dubbing-ai/vaja-thai
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dubbing-ai/vaja-thai
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- th
license: other
license_name: mixed-per-source
task_categories:
- text-to-speech
- automatic-speech-recognition
tags:
- thai
- tts
- speech
- multi-speaker
size_categories:
- 100K<n<1M
---
# Vaja-Thai (วาจา) — Combined Thai TTS Dataset
A unified, quality-filtered Thai speech dataset combining multiple sources for
Text-to-Speech (TTS) research. All audio is resampled to **24 kHz** WAV format.
## Dataset Summary
| Metric | Value |
|--------|-------|
| Total samples | 337,444 |
| Total hours | 647.4h |
| Sampling rate | 24,000 Hz |
| Format | WAV 16-bit PCM |
| Language | Thai (ภาษาไทย) |
## Sources
| Source | Samples | Hours | License | Description |
|--------|---------|-------|---------|-------------|
| tsync2 | 2,686 | 5.5h | CC-BY-NC-SA-3.0 | NECTEC professional TTS corpus, single female speaker |
| porjai_central | 218,076 | 495.5h | CC-BY-SA-4.0 | CMKL crowdsourced Central Thai speech |
| gigaspeech2 | 14,762 | 19.6h | non-commercial-research-only | GigaSpeech2 Thai dev+test (human-annotated) |
| commonvoice | 101,920 | 126.8h | CC-0 | Mozilla Common Voice Thai (validated split) |
## Loading the Dataset
```python
from datasets import load_dataset
# Load a specific source
ds = load_dataset("dubbing-ai/vaja-thai", "tsync2")
ds = load_dataset("dubbing-ai/vaja-thai", "porjai_central")
ds = load_dataset("dubbing-ai/vaja-thai", "gigaspeech2")
ds = load_dataset("dubbing-ai/vaja-thai", "commonvoice")
# Streaming mode (no full download needed)
ds = load_dataset("dubbing-ai/vaja-thai", "porjai_central", streaming=True)
for sample in ds["train"].take(10):
print(sample["text"])
# Combine all sources
from datasets import concatenate_datasets
ds_all = concatenate_datasets([
load_dataset("dubbing-ai/vaja-thai", c, split="train")
for c in ["tsync2", "porjai_central", "gigaspeech2", "commonvoice"]
])
```
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | Unique sample ID (`{source}_{original_id}`) |
| `audio` | Audio(24000) | Audio waveform |
| `text` | string | Thai transcription |
| `source` | string | Origin dataset name |
| `speaker_id` | string | Speaker identifier |
| `speaker_gender` | string | Gender if known (male/female/None) |
| `duration_s` | float | Duration in seconds |
| `original_sr` | int | Original sampling rate before resampling |
| `quality_tier` | int | 1–4 refined quality tier (see below) |
| `snr_db` | float | Estimated Signal-to-Noise Ratio in dB |
| `whisper_cer` | float | Character Error Rate from Whisper validation (None if skipped) |
| `license` | string | License of the source dataset |
## Quality Filtering
- **Whisper validation**: Samples from Porjai and Common Voice were transcribed with
`openai/whisper-large-v3-turbo` and filtered by Character Error Rate (CER ≤ 0.15).
TSync2 (studio quality) and GigaSpeech2 dev/test (human-annotated) were exempt.
- **Duration**: 1.0s – 30.0s
- **Audio energy**: Minimum RMS > -50 dBFS (removes near-silent clips)
- **Clipping**: < 1% clipped samples
## Upsampling
- Sources at 16 kHz (Porjai, GigaSpeech2) were upsampled using **AP-BWE**
(IEEE/ACM Trans. ASLP 2024), a GAN-based bandwidth extension model with dual-stream
amplitude-phase prediction. 292x real-time on GPU.
- TSync2 (22.05 kHz) was resampled with `librosa` kaiser_best.
- Common Voice (48 kHz MP3) was decoded and downsampled with `librosa`.
## Quality Tiers
Each sample has a `quality_tier` column (1–4) assigned based on **both source provenance
and measured audio quality** (CER + SNR). This ensures noisy ASR-origin samples don't
pollute TTS training, while clean ASR samples can still be promoted.
| Tier | Criteria | Description | Use case |
|------|----------|-------------|----------|
| **1** | Studio/human-annotated, OR ASR with CER ≤ 0.03 + SNR ≥ 25 dB | Highest quality | Fine-tuning, high-quality single/few-speaker TTS |
| **2** | CER ≤ 0.08 + SNR ≥ 15 dB | Clean ASR samples | Multi-speaker TTS with verified transcriptions |
| **3** | CER ≤ 0.15 + SNR ≥ 10 dB | Acceptable quality | Pre-training, data augmentation |
| **4** | Passes basic filters but lower measured quality | Marginal | Large-scale pre-training only, use with caution |
**Base assignments** (before refinement):
- TSync2, GigaSpeech2 dev/test → always Tier 1 (studio/human-annotated)
- Common Voice, Porjai Central → refined by CER + SNR measurements
Example — train only on tier 1+2 (recommended for TTS):
```python
ds = load_dataset("dubbing-ai/vaja-thai", "all")
ds_high_quality = ds.filter(lambda x: x["quality_tier"] <= 2)
```
Example — filter by SNR directly:
```python
ds_clean = ds.filter(lambda x: x["snr_db"] >= 20)
```
## Speaker Labels
- **tsync2**: Single known professional female speaker (`tsync2_nun`)
- **porjai_central**: No speaker labels available (`porjai_central_unknown`)
- **gigaspeech2**: YouTube channel ID used as speaker proxy
- **commonvoice**: `client_id` hash used as speaker proxy, with optional gender metadata
## License
Each config has its own license. The `all` config inherits the most restrictive terms
(**non-commercial**), but individual configs may be more permissive:
| Config | License | Commercial use |
|--------|---------|----------------|
| `tsync2` | CC-BY-NC-SA 3.0 | No |
| `porjai_central` | CC-BY-SA 4.0 | **Yes** |
| `gigaspeech2` | Non-commercial research/education only | No |
| `commonvoice` | CC-0 (public domain) | **Yes** |
Check the `license` column in each sample for per-sample license info.
## Citation
If you use this dataset, please cite the original source datasets:
```bibtex
@inproceedings{ardila-etal-2020-common,
title = "Common Voice: A Massively-Multilingual Speech Corpus",
author = "Ardila, Rosana and Branson, Megan and Davis, Kelly and Kohler, Michael
and Meyer, Josh and Henretty, Michael and Morais, Reuben and Saunders, Lindsay
and Tyers, Francis and Weber, Gregor",
booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
year = "2020",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.520/",
pages = "4218--4222"
}
@inproceedings{suwanbandit23_interspeech,
title = "Thai Dialect Corpus and Transfer-based Curriculum Learning
Investigation for Dialect Automatic Speech Recognition",
author = "Suwanbandit, Artit and Naowarat, Burin and Sangpetch, Orathai
and Chuangsuwanich, Ekapol",
booktitle = "Interspeech 2023",
year = "2023",
pages = "4069--4073",
doi = "10.21437/Interspeech.2023-1828"
}
@article{gigaspeech2,
title = "GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus
for Low-Resource Languages with Automated Crawling, Transcription and Refinement",
author = "Yang, Yifan and Song, Zheshu and Zhuo, Jianheng and Cui, Mingyu
and Li, Jinpeng and Yang, Bo and Du, Yexing and Ma, Ziyang
and Liu, Xunying and Wang, Ziyuan and Li, Ke and Fan, Shuai
and Yu, Kai and Zhang, Wei-Qiang and Chen, Guoguo and Chen, Xie",
journal = "arXiv preprint arXiv:2406.11546",
year = "2024"
}
@inproceedings{wutiwiwatchai2007tsync,
title = "An Intensive Design of a Thai Speech Synthesis Corpus",
author = "Wutiwiwatchai, Chai and Saychum, Sudaporn and Rugchatjaroen, Anocha",
booktitle = "International Symposium on Natural Language Processing (SNLP 2007)",
year = "2007"
}
```
提供机构:
dubbing-ai



