five

HUB5 Mandarin Telephone Speech Corpus

收藏
DataCite Commons2021-07-01 更新2024-07-13 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC98S69
下载链接
链接失效反馈
官方服务:
资源简介:
<p>LDC98S69 - Speech data <a href="http://catalog.ldc.upenn.edu/LDC98T26" rel="nofollow">LDC98T26</a> - Transcripts</p><br> <h3>Introduction</h3><br> <p>This release of HUB5 Mandarin training data consists of 42 calls derived from the CALLFRIEND Mandarin Chinese Mainland Dialect (Language ID) collection. The transcribed data is intended as additional training data in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), also sponsored by the U.S. Department of Defense. The transcripts cover a contiguous 5-30 minute segment taken from a recorded conversation lasting up to 30 minutes.&nbsp;</p><br> <p>LDC has released HUB5 Mandarin Telephone Speech and Transcripts Second Edition (<a href="../../../LDC2018S18">LDC2018S18</a>), which combines the speech and transcripts and make some updates to the release. See catalog entry for more details.</p><br> <h3>Data</h3><br> <p>Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements) and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America.</p><br> <h3>Updates</h3><br> <p>There are no updates at this time.</p></br> Portions © 1998 Trustees of the University of Pennsylvania

<p>LDC98S69——语音数据,配套转录文本为<a href="http://catalog.ldc.upenn.edu/LDC98T26" rel="nofollow">LDC98T26</a></p><br> <h3>引言</h3><br> <p>本版HUB5普通话训练数据集源自CALLFRIEND普通话大陆方言(语言辨识)语料库,包含42通电话会话。经转录的本数据集作为补充训练数据,用于支撑由美国国防部资助的大词汇量会话语音识别(Large Vocabulary Conversational Speech Recognition,LVCSR)相关研究项目。转录文本覆盖单通会话中一段连续的5至30分钟片段,原始会话录音时长最长可达30分钟。</p><br> <p>语言数据联盟(Linguistic Data Consortium,LDC)已发布HUB5普通话电话语音及转录文本第二版(<a href="../../../LDC2018S18">LDC2018S18</a>),该版本整合了语音数据与转录文本并对数据集进行了更新,详细信息请查阅对应目录条目。</p><br> <h3>数据采集</h3><br> <p>受试者招募:LDC通过互联网、公开出版物(广告)及个人联络渠道招募受访者参与本次电话语音采集项目。最终招募到200名通话发起者,每位受试者均通过LDC运维的免费机器人语音拨号系统拨打一通电话。受试者需使用项目招募人员在其注册时发放的唯一个人识别码(Personal Identification Number,PIN)接入该机器人拨号系统。项目方已告知所有受试者及其通话对象,本次通话将被录音,且仅当双方均同意被录音时,通话方可进行。每位受试者最多可通话30分钟,通话完成后,受试者可获得20美元报酬(同时可享受免费长途通话服务)。每位受试者仅可拨打一通电话,且未被限定通话话题。受试者可自由选择通话对象,绝大多数参与者选择拨打家人或密友的电话。所有通话均发起于北美地区,拨打范围覆盖北美境内多个地点。</p><br> <h3>更新说明</h3><br> <p>本次暂无更新内容。</p><br> 部分内容 © 1998 宾夕法尼亚大学托管委员会
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作