Continuous Speech Recognition Corpus - Disc 1 of 1

DataONE2018-02-12 更新2024-06-25 收录

下载链接：

https://search.dataone.org/view/sha256:e40df56e316572ce26911ae32557d7bca4d0eab5c5ca9abc9c64eefe4f4db931

下载链接

链接失效反馈

官方服务：

资源简介：

The third ARPA Continuous Speech Recognition (CSR) Benchmark Speech Test Collection is a three CD-ROM set that contains complete development test and evaluation test suites for speaker-independent, large-vocabulary speech recognition systems. The development and evaluation tests share a common structure, consisting of two core test components (\"hubs\") and seven specialized test components (\"spokes\"). The hub tests, which were mandatory for all ARPA CSR participants in the November '94 evaluations, provide a base-line for ASR performance, while the spokes provide the means for assessing the impact of particular speaking conditions or processing strategies in relation to base-line performance. Participants were free to take any combination of spoke tests according to their research interests. Taken together, the collection encompasses 180 speakers, each producing 20-40 sentences. These are organized into two complete development test sets and one evaluation set. The collection also includes complete documentation on the test specifications, data collection procedures, transcriptions and scoring protocols, together with the latest available version of NIST software for scoring ASR results and managing SPHERE waveform files. All speech data is accompanied by both the prompting texts and the detailed orthographic transcriptions of the utterances. This was the first ARPA CSR Benchmark Test in which prompting texts were drawn from a variety of news sources. Whereas earlier benchmarks were based on Wall Street Journal excerpts (from the period 1987-89), CSR-III prompts come a variety of North American Business News Services: Reuters News Service, New York Times, Wahington Post and Los Angeles Times as well as WSJ; all texts are drawn from financial news articles written during the period of April through June, 1994. (NAB stands for \"North American Business,\" in contrast to earlier benchmarks and training collections labeled \"WSJ\"). An important companion to the 1994 Benchmark Speech data collection is the four-disk CSR-III Text Collection (LDC95T6), which includes the ARPA CSR 1994 Standard Language Model. This corpus is also available from the LDC as a 1995 release. Because of restrictions imposed by the copyright holders of much of the NAB text, both the speech and text collections are available to LDC members only. For more information on how to join, send email to ldc@ldc.upenn.edu.

第三版ARPA连续语音识别（Continuous Speech Recognition, CSR）基准语音测试集是一套三张光盘的数据集，包含面向非特定人大词汇量语音识别系统的完整开发测试与评估测试套件。开发测试与评估测试共享统一结构，由两类核心测试组件（"枢纽测试组件"）与七类专项测试组件（"分支测试组件"）构成。其中枢纽测试为1994年11月ARPA CSR所有参与者的强制测试项，可为自动语音识别（Automatic Speech Recognition, ASR）性能提供性能基线；分支测试则用于评估特定说话场景或处理策略相对于基准性能的影响。参与者可根据自身研究兴趣，自由选择任意组合的分支测试项。本数据集共涵盖180名发音人，每名发音人录制20至40句语音，被划分为两套完整的开发测试集与一套评估测试集。此外，数据集还包含完整的测试规范、数据采集流程、转写文本与评分协议文档，以及可用于评分自动语音识别结果、管理SPHERE波形文件的最新版美国国家标准与技术研究院（National Institute of Standards and Technology, NIST）软件。所有语音数据均配有提示文本与话语的详细正字法转写内容。这是首个ARPA CSR基准测试，其提示文本源自各类新闻来源。早期基准测试均基于《华尔街日报（Wall Street Journal, WSJ）》1987至1989年的节选内容，而CSR-III的提示文本则取自各类北美商业新闻服务机构：路透社新闻专线、《纽约时报》《华盛顿邮报》《洛杉矶时报》及《华尔街日报》；所有文本均源自1994年4月至6月期间发布的财经新闻文章。（注：NAB即"北美商业新闻"，与此前标注为"WSJ"的基准测试及训练数据集形成区分）。 1994年基准语音数据集的重要配套资源为四张光盘的CSR-III文本数据集（LDC95T6），其中包含ARPA CSR 1994标准语言模型。该语料库同样由语言数据联盟（Linguistic Data Consortium, LDC）以1995年版的形式发布。由于多数NAB文本的版权方施加了使用限制，语音与文本数据集仅对LDC会员开放。如需了解加入方式，请发送邮件至ldc@ldc.upenn.edu。

创建时间：

2023-11-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集