changelinglab/thchs30-segment

Name: changelinglab/thchs30-segment
Creator: changelinglab
Published: 2026-04-12 16:49:39
License: 暂无描述

Hugging Face2026-04-12 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/changelinglab/thchs30-segment

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - zh pretty_name: THCHS-30 Segment task_categories: - automatic-speech-recognition tags: - speech - phone-alignment - segmentation - mandarin size_categories: - 10K<n<100K --- # THCHS-30 Segment Mandarin Chinese read-speech corpus with **phone-level time alignments**. Suitable for training and evaluating phone recognition and phonetic segmentation models. ## Sources - **Audio**: [THCHS-30](https://www.openslr.org/18/) (OpenSLR 18) by Dong Wang, Xuewei Zhang, Zhiyong Zhang (Tsinghua University, 2015). - **Phone alignments**: [`anyspeech/THCHS-30-alignments`](https://huggingface.co/datasets/anyspeech/THCHS-30-alignments). ## Splits | Split | Utterances | |-------|------------| | train | 10,000 | | val | 893 | | test | 2,495 | Splits follow the original OpenSLR 18 directory partition (`data_thchs30/{train,dev,test}`); `dev` is renamed to `val`. ## Schema | Column | Type | Description | |----------------|----------------------|------------------------------------------------------| | `utt_id` | string | Utterance id, e.g. `A11_0` | | `audio` | Audio(16 kHz) | Embedded waveform bytes (decoded on access) | | `text` | string | Hanzi sentence transcript | | `phones` | sequence[string] | IPA phone tokens with tone diacritics | | `phone_starts` | sequence[float64] | Phone start times in seconds | | `phone_ends` | sequence[float64] | Phone end times in seconds | | `language` | string | `cmn` (ISO 639-3) | | `speaker_id` | string | Speaker code (utt_id prefix, e.g. `A11`) | | `duration` | float64 | Utterance duration in seconds | | `split` | string | `train` / `val` / `test` | ## Phone inventory Phones are IPA with Mandarin tone diacritics, e.g. `lː`, `y˥˩`, `ʂ˘`, `ɻ̩˥˩`, `a˧˥˘`. Silence and pauses are marked with `[SIL]` intervals, which are kept in the alignment so boundary models can learn from them. ## License Released under the **Apache 2.0** license, matching the original THCHS-30 release. ## Citation ```bibtex @misc{THCHS30_2015, title={THCHS-30 : A Free Chinese Speech Corpus}, author={Dong Wang, Xuewei Zhang, Zhiyong Zhang}, year={2015}, url={http://arxiv.org/abs/1512.01882} } ```

提供机构：

changelinglab

5,000+

优质数据集

54 个

任务类型

进入经典数据集