five

Voicemail Corpus Part II

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2002S35
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Voicemail Corpus Part II was produced by Linguistic Data Consortium (LDC) catalog number LDC2002S35 and ISBN 1-58563-242-2. Voicemail Corpus Part II is a continuation of Voicemail Corpus Part I, <a href="../../../LDC98S77" rel="nofollow">LDC98S77</a>.</p><br> <h3>Data</h3><br> <p>This publication is comprised of speech and script files, and is structured in training and evaluation data. The training data consists of 2,048 voicemail messages and the corresponding script files. The speech and script files are organized in 41 directories, each of which contains up to 50 messages. The evaluation data consists of 50 voicemail messages and 50 scripts.</p><br> <p>The speech data is provided in sphere format it is sampled at 8 KHz, and recorded in 8-bit ulaw, totalling approximately 14 hours (406 MB) for training and 23 minutes (11 MB) for evaluation.</p><br> <p>In addition to the individual script files, there are three files which represent a concatenation of the individual scripts: train_scripts.all and eval_scripts .all represent a concatenation of the training and evaluation script files, one file per line, each line beginning with the fileID. eval_scripts_filtered.all is a filtered version of the file eval_scripts.all, after eliminating the tagged elements () and the proper nouns marker.</p><br> <h3>Updates</h3><br> <p>A more recent version of the paper Automatic Speech Recognition Performance on a Voicemail Transcription Task (M. Padmanabhan, G. Saon, J. Huang, B. Kingsbury and L. Mangu, IEEE Transactions on Speech and Audio Processing, vol 10, number 7, pp 433-442, October 2002) is available in both PDF and PS format by email request.</p><br> <h3>Samples</h3><br> <p>Please view the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2002S35.1.sph">Audio Sample 1 (SPH)</a></li><br> <li><a href="desc/addenda/LDC2002S35.1.txt">Transcript Sample 1 (TXT)</a></li><br> <li><a href="desc/addenda/LDC2002S35.2.sph">Audio Sample 2 (SPH)</a></li><br> <li><a href="desc/addenda/LDC2002S35.2.txt">Transcript Sample 2 (TXT)</a></li><br> </ul><br> <h3>Additional Licensing Instructions</h3><br> <p>This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact&nbsp;<a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>&nbsp;for information about becoming a member.</p></br> Portions © 2002 International Business Machines Corporation, © 2002 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作