CMKL/Porjai-Thai-voice-dataset-pattani

Name: CMKL/Porjai-Thai-voice-dataset-pattani
Creator: CMKL
Published: 2024-09-03 20:22:45
License: 暂无描述

Hugging Face2024-09-03 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/CMKL/Porjai-Thai-voice-dataset-pattani

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - th license: cc-by-nc-sa-4.0 dataset_info: features: - name: audio dtype: audio - name: sentence dtype: string - name: thai_sentence dtype: string - name: dialect_type dtype: string - name: utterance dtype: string splits: - name: train num_bytes: 464502838.88 num_examples: 39361 download_size: 445543336 dataset_size: 464502838.88 configs: - config_name: default data_files: - split: train path: data/train-* --- # Porjai-Thai-voice-dataset-pattani This corpus contains a officially split of 700 hours for Central Thai, and 40 hours for the three dialect each. The corpus is designed such that there are some parallel sentences between the dialects, making it suitable for Speech and Machine translation research. Our demo ASR model can be found at https://www.cmkl.ac.th/research/porjai. The Thai Central data was collected using [Wang Data Market](https://www.wang.in.th/). Since parts of this corpus are in the [ML-SUPERB](https://multilingual.superbbenchmark.org/) challenge, the test sets are not released in this github and would be released subsequently in ML-SUPERB. The baseline models of our corpus are at: [Thai-central](https://huggingface.co/SLSCU/thai-dialect_thai-central_model) [Khummuang](https://huggingface.co/SLSCU/thai-dialect_khummuang_model) [Korat](https://huggingface.co/SLSCU/thai-dialect_korat_model) [Pattani](https://huggingface.co/SLSCU/thai-dialect_pattani_model) The Thai-dialect Corpus is licensed under [CC-BY-SA 4.0.](https://creativecommons.org/licenses/by-sa/4.0/) # Acknowledgements This dataset was created with support from the PMU-C grant (Thai Language Automatic Speech Recognition Interface for Community E-Commerce, C10F630122) and compute support from the Apex cluster team. Some evaluation data was donated by Wang. # Citation ``` @inproceedings{suwanbandit23_interspeech, author={Artit Suwanbandit and Burin Naowarat and Orathai Sangpetch and Ekapol Chuangsuwanich}, title={{Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, pages={4069--4073}, doi={10.21437/Interspeech.2023-1828} } ```

提供机构：

CMKL

5,000+

优质数据集

54 个

任务类型

进入经典数据集