HKUST Mandarin Telephone Speech, Part 1
收藏Mendeley Data2024-01-31 更新2024-06-27 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2005S15
下载链接
链接失效反馈官方服务:
资源简介:
Introduction HKUST Mandarin Telephone Speech, Part 1 was developed by Hong Kong University of Science and Technology (HKUST). In 2004, HKUST was contracted to collect and transcribe 200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of collection into training, development and evaluation sets. This release contains the training and development sets with 873 and 24 calls, respectively. Data Collection Subject recruitment was done in several cities across mainland China. Most subjects did not previously know each other. To encourage more meaningful conversation, topics similar to those in Fisher English were designed. All calls were operator-assisted, namely, an operator would call two participants as scheduled to initiate a call. Subjects were asked about demographic questions before they were bridged for normal conversation. Their answers to the demographic questions were recorded on separate files. Subjects were allowed to talk up to 10 minutes. With a few exceptions, most calls are of the maximum length. Although subjects were allowed to make up to three calls, all subjects made just one call in this release with one exception, where PIN 10683 and PIN 10686 belong to a single individual. Each side of a call was recorded on a separate .wav file, sampled at 8-bits (a-law encoded), 8Khz. They were multiplexed later in sphere format with a-law encoding preserved. In the case where one side was shorter than the other, the shorter side was padded with silence. In the release, the file name of each recorded call is in the format of date_time_Apin_Bpin.sph and the corresponding transcript is in the same format with .txt extension. Speaker demographics Subjects were asked to provide several pieces of demographic information, including gender, age, native language/dialect, birthplace, education, occupation, phone type, etc. Given that Standard Mandarin is not the native dialect in many regions of China but is the official language of education and speakers may or may not have regional accents speaking Mandarin, it was decided that subjects birthplaces were divided into Mandarin-dominant and non-Mandarin-dominant regions and all calls were audited and classified into standard and accented types without further distinctions. Selected demographics - age, gender, birthplace, phone type and accent for each side of the call and the topic ID for the call - are provided as a tab-delimited, plain-text, tabular file. Samples To review an example of this corpus, please examine this wav or mp3 audio samples. © 2005 Trustees of the University of Pennsylvania
【数据集简介】香港科技大学普通话电话语音语料库(第一部分)由香港科技大学(Hong Kong University of Science and Technology, HKUST)开发。2004年,香港科技大学受委托在美国国防高级研究计划局(Defense Advanced Research Projects Agency, DARPA)的EARS(Effective Affordable Reusable Speech-to-Text)框架下,从中国内地的普通话使用者中采集并转录共计200小时的普通话会话电话语音数据。2004年6月,该语料库的前50小时语音数据及对应转录文本被发布至EARS社区,用于RT-04美国国家标准与技术研究院(National Institute of Standards and Technology, NIST)评测。美国国家标准与技术研究院(NIST)将剩余的150小时采集数据划分为训练集、开发集与测试集。本次发布的内容包含训练集与开发集,分别包含873路与24路通话。
【数据采集】本次数据的招募对象覆盖中国内地多座城市,绝大多数招募对象此前互不相识。为引导生成更具实质内容的会话,研究团队设计了与Fisher英语语料库相似的对话主题。所有通话均由话务员协助发起:话务员将按照预定计划分别呼叫两名参与对象,以此建立通话连接。在建立通话进行日常会话前,研究人员会向参与对象询问人口统计学相关问题,并将其回答单独存储为文件。参与对象单次通话时长上限为10分钟,除极少数例外情况外,绝大多数通话均达到了最长时长限制。尽管规则允许参与对象最多参与3次通话,但本次发布的所有参与对象均仅参与了1次通话,仅存在一个例外:PIN码10683与PIN码10686属于同一自然人。通话的双方音频分别存储于独立的.wav文件中,采样格式为8位a-law编码、采样率8kHz。后续这些音频文件会被复用为sphere格式,且保留原有的a-law编码。若通话某一方的音频时长短于另一方,则会通过添加静音帧补全时长。本次发布包中,每一路通话的录音文件命名格式为date_time_Apin_Bpin.sph,对应的转录文本文件命名格式完全一致,仅后缀名为.txt。
【说话者人口统计学信息】研究人员会向参与对象收集多项人口统计学信息,包括性别、年龄、母语/方言、出生地、学历、职业、所用手机类型等。考虑到普通话并非中国诸多地区的本土方言,但其作为官方教育用语被广泛使用,且说话者在使用普通话时可能带有或不带有地域口音,研究团队决定将参与对象的出生地划分为普通话主导区域与非普通话主导区域,并对所有通话进行人工审核,将其归类为标准普通话口音与带口音普通话两类,不再进行更细致的划分。本次发布包中还提供了一份制表符分隔的纯文本表格文件,其中包含每一路通话双方的部分人口统计学信息:年龄、性别、出生地、所用手机类型与口音,以及该通话的主题ID。
【语料样例】如需查看本次语料库的样例,请访问此处的.wav或.mp3音频样例。
© 2005 宾夕法尼亚大学托管会
创建时间:
2024-01-31
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是香港科技大学开发的普通话电话语音数据集,包含约149小时的对话语音,采样率为8kHz,采用a-law编码,主要用于自动内容提取研究。数据收集自中国大陆的普通话说话者,分为训练集和开发集,并标注了说话者的出生地信息以区分标准口音和带口音语音,适用于语音识别和方言分析任务。
以上内容由遇见数据集搜集并总结生成



