WSJCAM0 Cambridge Read News

Name: WSJCAM0 Cambridge Read News
Creator: UC Berkeley Library Dataverse
Published: 2024-10-15 23:07:12
License: 暂无描述

DataCite Commons2024-10-15 更新2025-04-16 收录

下载链接：

https://datasets.lib.berkeley.edu/citation?persistentId=doi:10.60503/D3/86HXXF

下载链接

链接失效反馈

官方服务：

资源简介：

A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition (The Cambridge University Version of the ARPA CSR Corpus WSJ0). This release of WSJCA0 represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of August 31, 1994. This collection was modelled directly on the ARPA CSR Corpus released by LDC in 1993: it used the same dual-microphone recording paradigm and a subset of prompting texts drawn from the Wall Street Journal. There are two key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 were native speakers of British English and (2) in addition to standard orthographic transcripts, WSJCAM0 also has information on the time alignment between the sampled waveform and both the words and the phonetic segments. The contents of the publication consist of the following: Training data from head-mounted microphone Development test data from head-mounted microphone, plus first set of evaluation test data Training data from desk-mounted microphone Development test data from desk-mounted microphone, plus second set of evaluation test data There are 90 utterances from each of 92 speakers that are designated as training material for speech recognition algorithms. An additional 48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary and another 40 sentences using a 64,000 word vocabulary, to be used as testing material. Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences. Recordings were made from two microphones: a far-field desk microphone and a head-mounted close-talking microphone. Within the train and test sets, speech data are organized by speaker prompting texts and detailed transcriptions and speaker information are included in each speaker directory. All waveform files have NIST SPHERE headers. Waveform data are compressed using the Shorten algorithm developed by Tony Robinson at Cambridge University, as adapted for use in the NIST SPHERE software package.

本语料库为面向大词汇量连续语音识别（Large Vocabulary Continuous Speech Recognition）的英式英语语音语料库，即ARPA CSR语料库（ARPA CSR Corpus）WSJ0的剑桥大学版本。本次发布的WSJCA0为该语料库的1.1版本，其首个版本于1994年8月31日由剑桥大学以磁带形式发布。该语料库直接以语言数据联盟（Linguistic Data Consortium, LDC）1993年发布的ARPA CSR语料库为蓝本，采用了相同的双麦克风录制范式，并使用了取自《华尔街日报》的提示文本子集。WSJ0与WSJCAM0存在两处核心差异：其一，WSJCAM0的录制受试者均为英式英语母语者；其二，除标准正字法转写文本外，WSJCAM0还提供了采样波形与单词、音素片段间的时间对齐信息。本发布包包含以下内容： 1. 头戴式麦克风录制的训练数据； 2. 头戴式麦克风录制的开发测试数据与第一组评估测试数据； 3. 桌面式麦克风录制的训练数据； 4. 桌面式麦克风录制的开发测试数据与第二组评估测试数据。 92名说话者每人提供90条语音片段，作为语音识别算法的训练素材。另有48名说话者每人朗读40句仅使用固定5000词词汇表的语句，以及另外40句使用64000词词汇表的语句，作为测试素材。全部140名说话者均录制了一套共18句的适配语句。本次录制采用两种麦克风：远场桌面麦克风与头戴式近距通话麦克风。在训练集与测试集中，语音数据按说话者与提示文本进行组织；每个说话者目录中均包含详细转写文本与说话者相关信息。所有波形文件均带有美国国家标准与技术研究院（National Institute of Standards and Technology, NIST）SPHERE文件头。波形数据采用由剑桥大学托尼·罗宾逊（Tony Robinson）开发的Shorten算法（Shorten algorithm）进行压缩，该算法已适配NIST SPHERE软件包的使用要求。

提供机构：

UC Berkeley Library Dataverse

创建时间：

2024-10-15

搜集汇总

数据集介绍