CSR-I (WSJ0) Complete

Name: CSR-I (WSJ0) Complete
Creator: UC Berkeley Library Dataverse
Published: 2025-01-28 17:27:18
License: 暂无描述

DataCite Commons2025-01-28 更新2025-04-16 收录

下载链接：

https://datasets.lib.berkeley.edu/citation?persistentId=doi:10.60503/D3/HW5W1B

下载链接

链接失效反馈

官方服务：

资源简介：

CSR-I (WSJ0) Complete was developed by NIST and contains approximately 141 hours of speech recordings of 123 speakers reading excerpts from the Wall Street Journal. About half the speakers are male and half female. Additionally, the discs contain complete orthographic transcriptions of the speech data and complete bigram language models for the Wall Street Journal text data from which the prompting text was taken. These materials are provided on disc 11-4 and also on disc 11-10. Disc 11-4 also contains the complete text of the WSJ articles from which the utterance prompts and language models were derived. During 1991, the DARPA Spoken Language Program initiated efforts to build a new corpus to support research on large-vocabulary Continuous Speech Recognition (CSR) systems. The first two CSR Corpora consist primarily of read speech with texts drawn from a machine-readable corpus of Wall Street Journal news text and are thus often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, however, will consist of read texts from other sources of North American business news and eventually from other news domains). The texts to be read were selected to fall within either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation for details). Some spontaneous dictation is included in addition to the read speech. The dictation portion was collected using journalists who dictated hypothetical news articles.

CSR-I (WSJ0) Complete 数据集由美国国家标准与技术研究院（National Institute of Standards and Technology，NIST）开发，包含123位朗读者朗读《华尔街日报》（Wall Street Journal，WSJ）节选内容的约141小时语音录音，其中男女朗读者各占半数左右。此外，该数据集的光盘还收录语音数据的完整正字法转录文本，以及用于生成提示文本的《华尔街日报》文本语料对应的完整二元语言模型（bigram language models）。上述资料分别存储于光盘11-4与11-10中；光盘11-4还包含构建朗读提示与语言模型所用的全部《华尔街日报》文章原文。1991年，美国国防高级研究计划局（Defense Advanced Research Projects Agency，DARPA）启动口语语言项目，着手构建新型语料库以支撑大词汇量连续语音识别（Continuous Speech Recognition，CSR）系统的研究工作。首批两套CSR语料库的语音数据均为朗读语音，文本源自可机读的《华尔街日报》新闻语料库，因此常被称为WSJ0与WSJ1。（不过后续推出的CSR系列语料库将采用北美商业新闻其他来源的朗读文本，并最终拓展至其他新闻领域。）待朗读的文本选自《华尔街日报》语料库的5000词或20000词子集（详细信息请参阅对应文档）。除朗读语音外，数据集还包含部分即兴听写语料：该部分语料通过记者口述假想新闻文章的方式采集获得。

提供机构：

UC Berkeley Library Dataverse

创建时间：

2025-01-28

搜集汇总

数据集介绍