2003 NIST Rich Transcription Evaluation Data

Name: 2003 NIST Rich Transcription Evaluation Data
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:19:29
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2007S10

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>2003 NIST Rich Transcription Evaluation Data contains the test material used in the 2003 Rich Transcription Spring and Fall evaluations administered by the <a href="http://www.nist.gov/speech" rel="nofollow">NIST (National Institute of Standards and Technology) Speech Group</a>. The Spring evaluation (RT-03S), implemented in March-April 2003, focused on Speech-To-Text (STT) tasks for broadcast news speech and conversational telephone speech in three languages: English, Mandarin Chinese and Arabic. That evaluation also included one Metadata Extraction (MDE) task, speaker diarization for broadcast news speech and conversational telephone speech in English. The Fall evaluation (RT-03F), implemented in October 2003, focused on MDE tasks including speaker diarization, speaker-attributed STT, SU (sentence/semantic unit) detection and disfluency detection for broadcast news speech and conversational telephone speech in English. For complete information about the evaluations, see the <a href="https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation">Rich Text Evaluation website</a>.</p><br> <h3>Data</h3><br> <p>The BN datasets were selected from <a href="http://projects.ldc.upenn.edu/TDT4/" rel="nofollow">TDT-4</a> sources collected in February 2001. The evaluation excerpts were transcribed to the nearest story boundary. The English BN dataset is approximately three hours long and is composed of 30-minute excerpts from six different broadcasts. The Mandarin Chinese BN dataset is approximately one hour long, consisting of 12-minute excerpts from five different broadcasts. The Arabic BN dataset is also approximately one hour long and contains 30-minute excerpts from two different broadcasts.</p><br> <p>The CTS datasets consist of material from various LDC telephone speech data. All evaluation excerpts were transcribed to the nearest turn. The English CTS set is approximately 6 hours long and is composed of 5-minute excerpts from 72 different conversations: 36 from the <a href="http://catalog.ldc.upenn.edu/LDC2001S13" rel="nofollow">Switchboard Cellular</a> collection and 36 from the <a href="http://catalog.ldc.upenn.edu/LDC2004S13" rel="nofollow">Fisher collection</a>. The Mandarin Chinese CTS dataset is approximately one hour long and consists of 5-minute excerpts from 12 different conversations from the <a href="http://catalog.ldc.upenn.edu/LDC96S55" rel="nofollow">CallFriend Mandarin Chinese data</a>. The Arabic CTS set is also approximately one hour long and contains 5-minute excerpts from 12 different conversations from the <a href="http://catalog.ldc.upenn.edu/LDC97S45" rel="nofollow">CallHome Egyptian Arabic data</a>.</p><br> <p>No manual (human-annotated) segmentations were provided. Sites were required to generate their own segmentations automatically.</p><br> <p>Unlike the BN audio files where the full broadcasts were provided, the CTS audio files contain only the evaluation excerpts. Each audio excerpt is a SPHERE-headered, two channel interleaved 8-bit mulaw file.</p><br> <h3>Samples</h3><br> <ul><br> <li><a href="desc/addenda/LDC2007S10.wav" rel="nofollow">English Broacast News Audio</a></li><br> <li><a href="desc/addenda/LDC2007S10_ind.txt" rel="nofollow">Indices</a></li><br> <li><a href="desc/addenda/LDC2007S10.txt" rel="nofollow">Transcriptions</a></li><br> </ul><br> <p>The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.</p></br> Portions © 2001 American Broadcasting Company, © 2001 Cable News Network, LP, LLLP, © 2001 China Broadcasting System (Taiwan), © 2001 China Central TV, © 2001 China National Radio, © 2001 China Television System (Taiwan), © 2001 National Broadcasting Company, © 2001 Nile TV, © 2001 Public Radio International, © 1996-2005, 2007 Trustees of the University of Pennsylvania<br><br>The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集