five

WSJCAM0 Cambridge Read News

收藏
Mendeley Data2024-01-31 更新2024-06-28 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC95S24
下载链接
链接失效反馈
官方服务:
资源简介:
A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition (The Cambridge University Version of the ARPA CSR Corpus WSJ0). This release of WSJCA0 represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of August 31, 1994. This collection was modelled directly on the ARPA CSR Corpus released by LDC in 1993: it used the same dual-microphone recording paradigm and a subset of prompting texts drawn from the Wall Street Journal. There are two key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 were native speakers of British English and (2) in addition to standard orthographic transcripts, WSJCAM0 also has information on the time alignment between the sampled waveform and both the words and the phonetic segments. The contents of the publication consist of the following: Training data from head-mounted microphone Development test data from head-mounted microphone, plus first set of evaluation test data Training data from desk-mounted microphone Development test data from desk-mounted microphone, plus second set of evaluation test data There are 90 utterances from each of 92 speakers that are designated as training material for speech recognition algorithms. An additional 48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary and another 40 sentences using a 64,000 word vocabulary, to be used as testing material. Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences. Recordings were made from two microphones: a far-field desk microphone and a head-mounted close-talking microphone. Within the train and test sets, speech data are organized by speaker prompting texts and detailed transcriptions and speaker information are included in each speaker directory. All waveform files have NIST SPHERE headers. Waveform data are compressed using the Shorten algorithm developed by Tony Robinson at Cambridge University, as adapted for use in the NIST SPHERE software package. Samples Please view the following samples: Head Mounted Mic Desk Mounted Mic Phoneme Alignments Word Alignments Updates On October 1, 2015 the corpus was modified to be released as a web download. Documentaiton was modified to reflect this. Portions © 1987-1989 Dow Jones & Company, Inc., © 1995 Trustees of the University of Pennsylvania
创建时间:
2024-01-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作