five

CSR-II (WSJ1) Other

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC94S13C
下载链接
链接失效反馈
官方服务:
资源简介:
<p><a href="http://catalog.ldc.upenn.edu/LDC94S13A" rel="nofollow">LDC94S13A</a> - Complete CSR-II corpus</p><br> <p><a href="http://catalog.ldc.upenn.edu/LDC94S13B" rel="nofollow">LDC94S13B</a> - CSR-II Sennheiser speech</p><br> <p>LDC94S13C - CSR-II Other speech</p><br> <h3>Data</h3><br> <p>The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours.</p><br> <p>In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech).</p><br> <p>WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression algorithm developed at Cambridge University.</p><br> <h3>Updates</h3><br> <p>The cdrom labeled "Evaluation Test Data, Part 1" (NIST Speech Disk 13-32.1) contains the file wsj1/doc/lng_modl/base_lm/tcb20onp.z ("WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z" on a Windows OS). Please note that even though this file has the ".z" extension, it is not a compressed file. In order to use the file, simply ignore the ".z" extension.</p></br>
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作