five

1997 Spanish Broadcast News Speech (HUB4-NE)

收藏
DataCite Commons2021-07-01 更新2024-07-13 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC98S74
下载链接
链接失效反馈
官方服务:
资源简介:
<p>LDC98S74 - Speech data <a href="http://catalog.ldc.upenn.edu/LDC98T29" rel="nofollow">LDC98T29</a> - Transcripts</p><br> <h3>Introduction</h3><br> <p>This corpus contains a portion of the acoustic data designated as the training set for the 1997 DARPA HUB4 Spanish Benchmark. It contains speech and transcripts of 30 hours of broadcast news from the following sources: Televisa, Univision and VOA.</p><br> <h3>Data</h3><br> <p>All acoustic files are in NIST SPHERE format, without compression. The sample data are 16-bit linear PCM, 16-KHz sample frequency, single channel. Most files contain 30 minutes of recorded material and some contain 60 or 120 minutes (approximately); the sampling format requires roughly two megabytes (MB) per minute of recording, so the file sizes are typically around 60 MB, with some files ranging up to 120 or 240 MB.</p><br> <p>The transcripts are in SGML format, using the same markup conventions that have been applied to the other 1997 Broadcast News speech corpora (in English and Mandarin) and are transmitted by FTP, not on the CD-ROMs with speech data.</p><br> <h3>Updates</h3><br> <p>There are no updates at this time.</p><br> <h3>Additional Licensing Instructions</h3><br> <p>This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact&nbsp;<a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>&nbsp;for information about becoming a member.</p></br> Portions © 1997 Televisa S.A. de C.V., © 1997 Univision Network Limited Partnership, © 1997, 1998 Trustees of the University of Pennsylvania

<p>LDC98S74 - 语音数据 <a href="http://catalog.ldc.upenn.edu/LDC98T29" rel="nofollow">LDC98T29</a> - 转写文本</p><br> <h3>简介</h3><br> <p>本语料库包含1997年美国国防高级研究计划局(Defense Advanced Research Projects Agency,DARPA)HUB4西班牙语基准测试训练集的部分声学数据,收录了来自Televisa、Univision以及VOA(Voice of America,美国之音)的30小时广播新闻语音及其转写文本。</p><br> <h3>数据说明</h3><br> <p>所有声学文件均采用美国国家标准与技术研究院(National Institute of Standards and Technology,NIST)SPHERE格式,无压缩。采样数据为16位线性PCM编码,采样率16kHz,单声道。多数单文件时长约30分钟,部分单文件时长可达60或120分钟(近似值);按采样格式计算,每分钟录音约占用2兆字节(MB)存储空间,因此单文件大小通常约为60MB,部分文件可达120MB或240MB。</p><br> <p>转写文本采用标准通用标记语言(Standard Generalized Markup Language,SGML)格式,其标记规范与其他1997年面向英语、普通话的广播新闻语音语料库一致,需通过文件传输协议(File Transfer Protocol,FTP)获取,而非随语音数据一同封装在CD-ROM中。</p><br> <h3>更新说明</h3><br> <p>目前暂无更新记录。</p><br> <h3>额外授权须知</h3><br> <p>本“仅限会员”语料库仅对当前注册会员开放,会员可申请并以优惠授权费获取该数据。如需了解会员注册相关信息,请联系<a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>。</p></br> <p>部分内容 © 1997 Televisa S.A. de C.V.、© 1997 Univision Network Limited Partnership、© 1997、1998 宾夕法尼亚大学董事会</p>
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作