CMKL/Porjai-Thai-voice-dataset-pattani
收藏Hugging Face2024-09-03 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CMKL/Porjai-Thai-voice-dataset-pattani
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- th
license: cc-by-nc-sa-4.0
dataset_info:
features:
- name: audio
dtype: audio
- name: sentence
dtype: string
- name: thai_sentence
dtype: string
- name: dialect_type
dtype: string
- name: utterance
dtype: string
splits:
- name: train
num_bytes: 464502838.88
num_examples: 39361
download_size: 445543336
dataset_size: 464502838.88
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Porjai-Thai-voice-dataset-pattani
This corpus contains a officially split of 700 hours for Central Thai, and 40 hours for the three dialect each. The corpus is designed such that there are some parallel sentences between the dialects, making it suitable for Speech and Machine translation research.
Our demo ASR model can be found at https://www.cmkl.ac.th/research/porjai. The Thai Central data was collected using [Wang Data Market](https://www.wang.in.th/).
Since parts of this corpus are in the [ML-SUPERB](https://multilingual.superbbenchmark.org/) challenge, the test sets are not released in this github and would be released subsequently in ML-SUPERB.
The baseline models of our corpus are at:
[Thai-central](https://huggingface.co/SLSCU/thai-dialect_thai-central_model)
[Khummuang](https://huggingface.co/SLSCU/thai-dialect_khummuang_model)
[Korat](https://huggingface.co/SLSCU/thai-dialect_korat_model)
[Pattani](https://huggingface.co/SLSCU/thai-dialect_pattani_model)
The Thai-dialect Corpus is licensed under [CC-BY-SA 4.0.](https://creativecommons.org/licenses/by-sa/4.0/)
# Acknowledgements
This dataset was created with support from the PMU-C grant (Thai Language Automatic Speech Recognition Interface for Community E-Commerce, C10F630122) and compute support from the Apex cluster team. Some evaluation data was donated by Wang.
# Citation
```
@inproceedings{suwanbandit23_interspeech,
author={Artit Suwanbandit and Burin Naowarat and Orathai Sangpetch and Ekapol Chuangsuwanich},
title={{Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={4069--4073},
doi={10.21437/Interspeech.2023-1828}
}
```
提供机构:
CMKL



