CSR-II (WSJ1) Complete

Mendeley Data2024-01-31 更新2024-06-28 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC94S13A

下载链接

链接失效反馈

官方服务：

资源简介：

LDC94S13A - Complete CSR-II corpus LDC94S13B - CSR-II Sennheiser speech LDC94S13C - CSR-II Other speech Data The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours. In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or "hub" condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech). WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded "Shorten" compression algorithm developed at Cambridge University. Samples Please listen to this audio sample. Updates The cdrom labeled "Evaluation Test Data, Part 1" (NIST Speech Disk 13-32.1) contains the file wsj1/doc/lng_modl/base_lm/tcb20onp.z ("WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z" on a Windows OS). Please note that even though this file has the ".z" extension, it is not a compressed file. In order to use the file, simply ignore the ".z" extension.

LDC94S13A — 完整CSR-II语料库；LDC94S13B — CSR-II森海塞尔语音语料库；LDC94S13C — CSR-II其他语音数据。完整WSJ1语料库包含约78000条训练语句（总语音时长73小时），其中4000条由具备不同速记经验的记者完成的自发口述内容构成。该语料库包含约8200条“常规”开发测试语句（总语音时长8小时），其中6800条来自自发口述。与试点语料库一致，整套语料库采用双麦克风采集，因此总语音时长约为162小时。1993年初，“中心-分支（Hub and Spoke）”测试范式被提出，该范式要求构建11组测试集，每组均为基础/“中心”场景的特定变体。这11组中心-分支开发与评估测试集各包含约7500个波形文件（总语音时长11小时）。WSJ1的波形文件已通过剑桥大学开发的内嵌于SPHERE的“Shorten”压缩算法实现约2:1的压缩率。示例音频请收听该音频示例。更新标注为“评估测试数据第1部分”的光盘（NIST语音光盘13-32.1）包含文件"wsj1/doc/lng_modl/base_lm/tcb20onp.z"（Windows系统下路径为"WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z"）。请注意，尽管该文件后缀为".z"，但其并非压缩文件，使用时可直接忽略该".z"后缀。

创建时间：

2024-01-31

搜集汇总

数据集介绍

背景与挑战

背景概述

CSR-II (WSJ1) Complete是一个英语语音识别数据集，包含大量训练和测试语音，特别包含自发听写样本，适用于语音识别研究。数据经过压缩处理，总时长约162小时。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集