dubbing-ai/vaja-thai

Name: dubbing-ai/vaja-thai
Creator: dubbing-ai
Published: 2026-03-28 09:16:43
License: 暂无描述

Hugging Face2026-03-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/dubbing-ai/vaja-thai

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - th license: other license_name: mixed-per-source task_categories: - text-to-speech - automatic-speech-recognition tags: - thai - tts - speech - multi-speaker size_categories: - 100K<n<1M --- # Vaja-Thai (วาจา) — Combined Thai TTS Dataset A unified, quality-filtered Thai speech dataset combining multiple sources for Text-to-Speech (TTS) research. All audio is resampled to **24 kHz** WAV format. ## Dataset Summary | Metric | Value | |--------|-------| | Total samples | 337,444 | | Total hours | 647.4h | | Sampling rate | 24,000 Hz | | Format | WAV 16-bit PCM | | Language | Thai (ภาษาไทย) | ## Sources | Source | Samples | Hours | License | Description | |--------|---------|-------|---------|-------------| | tsync2 | 2,686 | 5.5h | CC-BY-NC-SA-3.0 | NECTEC professional TTS corpus, single female speaker | | porjai_central | 218,076 | 495.5h | CC-BY-SA-4.0 | CMKL crowdsourced Central Thai speech | | gigaspeech2 | 14,762 | 19.6h | non-commercial-research-only | GigaSpeech2 Thai dev+test (human-annotated) | | commonvoice | 101,920 | 126.8h | CC-0 | Mozilla Common Voice Thai (validated split) | ## Loading the Dataset ```python from datasets import load_dataset # Load a specific source ds = load_dataset("dubbing-ai/vaja-thai", "tsync2") ds = load_dataset("dubbing-ai/vaja-thai", "porjai_central") ds = load_dataset("dubbing-ai/vaja-thai", "gigaspeech2") ds = load_dataset("dubbing-ai/vaja-thai", "commonvoice") # Streaming mode (no full download needed) ds = load_dataset("dubbing-ai/vaja-thai", "porjai_central", streaming=True) for sample in ds["train"].take(10): print(sample["text"]) # Combine all sources from datasets import concatenate_datasets ds_all = concatenate_datasets([ load_dataset("dubbing-ai/vaja-thai", c, split="train") for c in ["tsync2", "porjai_central", "gigaspeech2", "commonvoice"] ]) ``` ## Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | Unique sample ID (`{source}_{original_id}`) | | `audio` | Audio(24000) | Audio waveform | | `text` | string | Thai transcription | | `source` | string | Origin dataset name | | `speaker_id` | string | Speaker identifier | | `speaker_gender` | string | Gender if known (male/female/None) | | `duration_s` | float | Duration in seconds | | `original_sr` | int | Original sampling rate before resampling | | `quality_tier` | int | 1–4 refined quality tier (see below) | | `snr_db` | float | Estimated Signal-to-Noise Ratio in dB | | `whisper_cer` | float | Character Error Rate from Whisper validation (None if skipped) | | `license` | string | License of the source dataset | ## Quality Filtering - **Whisper validation**: Samples from Porjai and Common Voice were transcribed with `openai/whisper-large-v3-turbo` and filtered by Character Error Rate (CER ≤ 0.15). TSync2 (studio quality) and GigaSpeech2 dev/test (human-annotated) were exempt. - **Duration**: 1.0s – 30.0s - **Audio energy**: Minimum RMS > -50 dBFS (removes near-silent clips) - **Clipping**: < 1% clipped samples ## Upsampling - Sources at 16 kHz (Porjai, GigaSpeech2) were upsampled using **AP-BWE** (IEEE/ACM Trans. ASLP 2024), a GAN-based bandwidth extension model with dual-stream amplitude-phase prediction. 292x real-time on GPU. - TSync2 (22.05 kHz) was resampled with `librosa` kaiser_best. - Common Voice (48 kHz MP3) was decoded and downsampled with `librosa`. ## Quality Tiers Each sample has a `quality_tier` column (1–4) assigned based on **both source provenance and measured audio quality** (CER + SNR). This ensures noisy ASR-origin samples don't pollute TTS training, while clean ASR samples can still be promoted. | Tier | Criteria | Description | Use case | |------|----------|-------------|----------| | **1** | Studio/human-annotated, OR ASR with CER ≤ 0.03 + SNR ≥ 25 dB | Highest quality | Fine-tuning, high-quality single/few-speaker TTS | | **2** | CER ≤ 0.08 + SNR ≥ 15 dB | Clean ASR samples | Multi-speaker TTS with verified transcriptions | | **3** | CER ≤ 0.15 + SNR ≥ 10 dB | Acceptable quality | Pre-training, data augmentation | | **4** | Passes basic filters but lower measured quality | Marginal | Large-scale pre-training only, use with caution | **Base assignments** (before refinement): - TSync2, GigaSpeech2 dev/test → always Tier 1 (studio/human-annotated) - Common Voice, Porjai Central → refined by CER + SNR measurements Example — train only on tier 1+2 (recommended for TTS): ```python ds = load_dataset("dubbing-ai/vaja-thai", "all") ds_high_quality = ds.filter(lambda x: x["quality_tier"] <= 2) ``` Example — filter by SNR directly: ```python ds_clean = ds.filter(lambda x: x["snr_db"] >= 20) ``` ## Speaker Labels - **tsync2**: Single known professional female speaker (`tsync2_nun`) - **porjai_central**: No speaker labels available (`porjai_central_unknown`) - **gigaspeech2**: YouTube channel ID used as speaker proxy - **commonvoice**: `client_id` hash used as speaker proxy, with optional gender metadata ## License Each config has its own license. The `all` config inherits the most restrictive terms (**non-commercial**), but individual configs may be more permissive: | Config | License | Commercial use | |--------|---------|----------------| | `tsync2` | CC-BY-NC-SA 3.0 | No | | `porjai_central` | CC-BY-SA 4.0 | **Yes** | | `gigaspeech2` | Non-commercial research/education only | No | | `commonvoice` | CC-0 (public domain) | **Yes** | Check the `license` column in each sample for per-sample license info. ## Citation If you use this dataset, please cite the original source datasets: ```bibtex @inproceedings{ardila-etal-2020-common, title = "Common Voice: A Massively-Multilingual Speech Corpus", author = "Ardila, Rosana and Branson, Megan and Davis, Kelly and Kohler, Michael and Meyer, Josh and Henretty, Michael and Morais, Reuben and Saunders, Lindsay and Tyers, Francis and Weber, Gregor", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", year = "2020", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.520/", pages = "4218--4222" } @inproceedings{suwanbandit23_interspeech, title = "Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition", author = "Suwanbandit, Artit and Naowarat, Burin and Sangpetch, Orathai and Chuangsuwanich, Ekapol", booktitle = "Interspeech 2023", year = "2023", pages = "4069--4073", doi = "10.21437/Interspeech.2023-1828" } @article{gigaspeech2, title = "GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement", author = "Yang, Yifan and Song, Zheshu and Zhuo, Jianheng and Cui, Mingyu and Li, Jinpeng and Yang, Bo and Du, Yexing and Ma, Ziyang and Liu, Xunying and Wang, Ziyuan and Li, Ke and Fan, Shuai and Yu, Kai and Zhang, Wei-Qiang and Chen, Guoguo and Chen, Xie", journal = "arXiv preprint arXiv:2406.11546", year = "2024" } @inproceedings{wutiwiwatchai2007tsync, title = "An Intensive Design of a Thai Speech Synthesis Corpus", author = "Wutiwiwatchai, Chai and Saychum, Sudaporn and Rugchatjaroen, Anocha", booktitle = "International Symposium on Natural Language Processing (SNLP 2007)", year = "2007" } ```

提供机构：

dubbing-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集