Mandarin Chinese Phonetic Segmentation and Tone

Name: Mandarin Chinese Phonetic Segmentation and Tone
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:27:10
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2015S05

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>Mandarin Chinese Phonetic Segmentation and Tone was developed by the Linguistic Data Consortium (LDC) and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (<a href="../../../LDC98S73">LDC98S73</a> and <a href="../../../LDC98T24">LDC98T24</a>, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA.</p><br> <p>The ability to use large speech corpora for research in phonetics, sociolinguistics and psychology, among other fields, depends on the availability of phonetic segmentation and transcriptions. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese. Using the approach of embedded tone modeling (also used for incorporating tones for automatic speech recognition), the performance on forced alignment between tone-dependent and tone-independent models was compared.</p><br> <h3>Data</h3><br> <p>Utterances were considered as the time-stamped between-pause units in the transcribed news recordings. Those with background noise, music, unidentified speakers and accented speakers were excluded. A test set was developed with 300 utterances randomly selected from six speakers (50 utterances for each speaker). The remaining 7,549 utterances formed a training set.</p><br> <p>The utterances in the test set were manually labeled and segmented into initials and finals in Pinyin, a Roman alphabet system for transcribing Chinese characters. Tones were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral tone. The Sandhi Tone3 was labeled as Tone2. The training set was automatically segmented and transcribed using the LDC forced aligner, which is a Hidden Markov Model (HMM) aligner trained on the same utterances (Yuan et al. 2014). The aligner achieved 93.1% agreement (of phone boundaries) within 20 ms on the test set compared to manual segmentation. The quality of the phonetic transcription and tone labels of the training set was evaluated by checking 100 utterances randomly selected from it. The 100 utterances contained 1,252 syllables: 15 syllables had mistaken tone transcriptions; two syllables showed mistaken transcriptions of the final, and there were no syllables with transcription errors on the initial.</p><br> <p>Each utterance has three associated files: a flac compressed wav file, a word transcript file, and a phonetic boundaries and label file.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2015S05.wav">audio sample</a>, <a href="desc/addenda/LDC2015S05.txt">transcript sample</a> and <a href="desc/addenda/LDC2015S05.phons.txt">phonetic labels sample</a>.</p><br> <h3>Acknowledgement</h3><br> <p>This work was supported in part by National Science Foundation Grant No. IIS-0964556.</p><br> <h3>Updates</h3><br> <p>None at this time</p><br> <h3>Additional Licensing Instructions</h3><br> <p>This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact <a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> for information about becoming a member.</p></br> Portions © 1997 China Central TV, © 1997 MultiCultural Broadcasting Corporation, © 1997, 1998, 2007, 2015 Trustees of the University of Pennsylvania

<h3>引言</h3><br> <p>《汉语普通话语音切分与声调数据集》由语言数据联盟（Linguistic Data Consortium，LDC）开发，包含7849条汉语普通话话语及其语音切分和声调标签，分为训练集与测试集。这些话语来源于1997年汉语广播新闻语音与转录文本（HUB4-NE）（分别对应<a href="../../../LDC98S73">LDC98S73</a>和<a href="../../../LDC98T24">LDC98T24</a>）。该集合包含约30小时的汉语广播新闻录音，来源包括美国之音、中国中央电视台以及位于加州洛杉矶的商业广播电台KAZN-AM。</p><br> <p>在语音学、社会语言学、心理学等领域，利用大型语音语料库开展研究的能力取决于是否具备语音切分与转录文本。本语料库旨在探究音素边界模型在汉语普通话强制对齐中的应用。研究采用嵌入式声调建模方法（该方法也用于自动语音识别中的声调整合），对比了依赖声调与不依赖声调的模型在强制对齐任务上的性能。</p><br> <h3>数据</h3><br> <p>话语被定义为转录新闻录音中带时间戳的停顿间单位，排除了含背景噪音、音乐、未知说话人及带口音说话人的样本。测试集由6位说话人中随机选取的300条话语组成（每位说话人50条），剩余7549条话语构成训练集。</p><br> <p>测试集中的话语经过人工标注，按汉语拼音（转写汉字的罗马字母系统）切分为声母与韵母，并在韵母上标记声调（含1至4声及表示轻声的0声）。三声变调被标注为2声。训练集采用LDC强制对齐工具自动切分与转录，该工具为基于相同话语训练的隐马尔可夫模型（Hidden Markov Model，HMM）对齐器（Yuan等，2014）。与人工切分结果相比，该对齐器在20毫秒内的音素边界一致性达93.1%。通过随机抽取训练集100条话语评估质量：100条话语含1252个音节，其中15个音节声调标注错误、2个韵母转录错误，声母无错误。</p><br> <p>每条话语对应三个关联文件：FLAC压缩的WAV文件、词汇转录文件、语音边界与标签文件。</p><br> <h3>样本</h3><br> <p>请查看以下样本：<a href="desc/addenda/LDC2015S05.wav">音频样本</a>、<a href="desc/addenda/LDC2015S05.txt">转录文本样本</a>、<a href="desc/addenda/LDC2015S05.phons.txt">语音标签样本</a>。</p><br> <h3>致谢</h3><br> <p>本工作部分得到美国国家科学基金会IIS-0964556号资助。</p><br> <h3>更新</h3><br> <p>目前无更新</p><br> <h3>附加许可说明</h3><br> <p>本语料库为“仅限会员”资源，现有会员可按所列优惠许可费申请数据。如需了解会员注册信息，请联系<a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>。</p></br> 部分内容 © 1997 中国中央电视台，© 1997 多元文化广播公司，© 1997、1998、2007、2015 宾夕法尼亚大学董事会

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集