five

changelinglab/thchs30-segment

收藏
Hugging Face2026-04-12 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/changelinglab/thchs30-segment
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - zh pretty_name: THCHS-30 Segment task_categories: - automatic-speech-recognition tags: - speech - phone-alignment - segmentation - mandarin size_categories: - 10K<n<100K --- # THCHS-30 Segment Mandarin Chinese read-speech corpus with **phone-level time alignments**. Suitable for training and evaluating phone recognition and phonetic segmentation models. ## Sources - **Audio**: [THCHS-30](https://www.openslr.org/18/) (OpenSLR 18) by Dong Wang, Xuewei Zhang, Zhiyong Zhang (Tsinghua University, 2015). - **Phone alignments**: [`anyspeech/THCHS-30-alignments`](https://huggingface.co/datasets/anyspeech/THCHS-30-alignments). ## Splits | Split | Utterances | |-------|------------| | train | 10,000 | | val | 893 | | test | 2,495 | Splits follow the original OpenSLR 18 directory partition (`data_thchs30/{train,dev,test}`); `dev` is renamed to `val`. ## Schema | Column | Type | Description | |----------------|----------------------|------------------------------------------------------| | `utt_id` | string | Utterance id, e.g. `A11_0` | | `audio` | Audio(16 kHz) | Embedded waveform bytes (decoded on access) | | `text` | string | Hanzi sentence transcript | | `phones` | sequence[string] | IPA phone tokens with tone diacritics | | `phone_starts` | sequence[float64] | Phone start times in seconds | | `phone_ends` | sequence[float64] | Phone end times in seconds | | `language` | string | `cmn` (ISO 639-3) | | `speaker_id` | string | Speaker code (utt_id prefix, e.g. `A11`) | | `duration` | float64 | Utterance duration in seconds | | `split` | string | `train` / `val` / `test` | ## Phone inventory Phones are IPA with Mandarin tone diacritics, e.g. `lː`, `y˥˩`, `ʂ˘`, `ɻ̩˥˩`, `a˧˥˘`. Silence and pauses are marked with `[SIL]` intervals, which are kept in the alignment so boundary models can learn from them. ## License Released under the **Apache 2.0** license, matching the original THCHS-30 release. ## Citation ```bibtex @misc{THCHS30_2015, title={THCHS-30 : A Free Chinese Speech Corpus}, author={Dong Wang, Xuewei Zhang, Zhiyong Zhang}, year={2015}, url={http://arxiv.org/abs/1512.01882} } ```
提供机构:
changelinglab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作