2003 NIST Rich Transcription Evaluation Data
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2007S10
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>2003 NIST Rich Transcription Evaluation Data contains the test material used in the 2003 Rich Transcription Spring and Fall evaluations administered by the <a href="http://www.nist.gov/speech" rel="nofollow">NIST (National Institute of Standards and Technology) Speech Group</a>. The Spring evaluation (RT-03S), implemented in March-April 2003, focused on Speech-To-Text (STT) tasks for broadcast news speech and conversational telephone speech in three languages: English, Mandarin Chinese and Arabic. That evaluation also included one Metadata Extraction (MDE) task, speaker diarization for broadcast news speech and conversational telephone speech in English. The Fall evaluation (RT-03F), implemented in October 2003, focused on MDE tasks including speaker diarization, speaker-attributed STT, SU (sentence/semantic unit) detection and disfluency detection for broadcast news speech and conversational telephone speech in English. For complete information about the evaluations, see the <a href="https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation">Rich Text Evaluation website</a>.</p><br>
<h3>Data</h3><br>
<p>The BN datasets were selected from <a href="http://projects.ldc.upenn.edu/TDT4/" rel="nofollow">TDT-4</a> sources collected in February 2001. The evaluation excerpts were transcribed to the nearest story boundary. The English BN dataset is approximately three hours long and is composed of 30-minute excerpts from six different broadcasts. The Mandarin Chinese BN dataset is approximately one hour long, consisting of 12-minute excerpts from five different broadcasts. The Arabic BN dataset is also approximately one hour long and contains 30-minute excerpts from two different broadcasts.</p><br>
<p>The CTS datasets consist of material from various LDC telephone speech data. All evaluation excerpts were transcribed to the nearest turn. The English CTS set is approximately 6 hours long and is composed of 5-minute excerpts from 72 different conversations: 36 from the <a href="http://catalog.ldc.upenn.edu/LDC2001S13" rel="nofollow">Switchboard Cellular</a> collection and 36 from the <a href="http://catalog.ldc.upenn.edu/LDC2004S13" rel="nofollow">Fisher collection</a>. The Mandarin Chinese CTS dataset is approximately one hour long and consists of 5-minute excerpts from 12 different conversations from the <a href="http://catalog.ldc.upenn.edu/LDC96S55" rel="nofollow">CallFriend Mandarin Chinese data</a>. The Arabic CTS set is also approximately one hour long and contains 5-minute excerpts from 12 different conversations from the <a href="http://catalog.ldc.upenn.edu/LDC97S45" rel="nofollow">CallHome Egyptian Arabic data</a>.</p><br>
<p>No manual (human-annotated) segmentations were provided. Sites were required to generate their own segmentations automatically.</p><br>
<p>Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file.</p><br>
<h3>Samples</h3><br>
<ul><br>
<li><a href="desc/addenda/LDC2007S10.wav" rel="nofollow">English Broacast News Audio</a></li><br>
<li><a href="desc/addenda/LDC2007S10_ind.txt" rel="nofollow">Indices</a></li><br>
<li><a href="desc/addenda/LDC2007S10.txt" rel="nofollow">Transcriptions</a></li><br>
</ul><br>
<p>The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.</p></br>
Portions © 2001 American Broadcasting Company, © 2001 Cable News Network, LP, LLLP, © 2001 China Broadcasting System (Taiwan), © 2001 China Central TV, © 2001 China National Radio, © 2001 China Television System (Taiwan), © 2001 National Broadcasting Company, © 2001 Nile TV, © 2001 Public Radio International, © 1996-2005, 2007 Trustees of the University of Pennsylvania<br><br>The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



