five

CSLU: Names Release 1.3

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006S39
下载链接
链接失效反馈
官方服务:
资源简介:
<p> A common problem in training and developing speech recognition systems is scarcity of data, especially particular phonemic contexts. The Center for Spoken Language Understanding is attempting to address this problem with the Names Corpus. The Names Corpus is a collection of name utterances, both first and last names, from several thousand different speakers over the telephone. Name utterances are "spontaneous" in that the subject is not reading from a word list. </p><p> Another area of active research is the development of name Recognition systems. The Names Corpus is a useful resource for addressing this problem. </p> <p> The utterances in this corpus were taken from many other telephone speech data collections that have been completed at the CSLU. In most data collections, the callers were asked to leave their name at some point. Also, the callers would occasionally leave their name in the midst of another utterance. The names in these situations were extracted out of the host utterance and added to the Names Corpus. </p> <p> Each file in the Names Corpus has an orthographic transcription following the CSLU Labeling Conventions. Also, to take advantage of the phonemic variability, many of the utterances have been phonetically transcribed. The selection of files to phonetically transcribe was constrained by a process that selected files that were suspected to contain phonetic contexts that had not yet been transcribed. </p> <p> Release 1.3 of this corpus contains 24,245 files. All of these have been phonetically labeled. Approximately 40% of the bigram phonemic contexts possible, without regard to language constraints, are represented. </p> <h3>Samples</h3> <p> For an example of the data in this publication, please review this audio <a href="./desc/addenda/LDC2006S39.wav" rel="nofollow">sample</a> and its <a href="./desc/addenda/LDC2006S39.txt" rel="nofollow">transcription</a>. </p> </br> Portions © 2001, 2003 Speech Technology Center Ltd., © 2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作