RT-03 MDE Training Data Text and Annotations

Name: RT-03 MDE Training Data Text and Annotations
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:16:52
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2004T12

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> <p> MDE RT-03 Training Data Text and Annotations corpus was produced by Linguistic Data Consortium (LDC), catalog number LDC2004T12 and ISBN 1-58563-301-1. </p><p> This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. </p><p>The data in this release consists of English Conversational Telephone Speech (CTS) and Broadcast News (BN) transcripts and annotations. The corresponding speech data is available as <a href="http://catalog.ldc.upenn.edu/LDC2004S08" rel="nofollow">MDE RT-03 Training Data Speech </a>. </p><h3>Data</h3> <p>There are 633 files, totalling approximately 747 MB with a total of 764,978 tokens. The transcripts and annotations cover approximately 20 hours of Broadcast News and over 40 hours of Conversational Telephone Speech data. The annotated data was originally developed to support the DARPA EARS Metadata Extraction (MDE) Program, and was distributed as training data for the RT-03F evaluation cycle. </p><p>The CTS data was drawn from the <a href="http://catalog.ldc.upenn.edu/LDC97S62" rel="nofollow">Switchboard-1 Release 2</a> corpus. </p><p>The BN speech data was drawn from the <a href="http://catalog.ldc.upenn.edu/LDC98S71" rel="nofollow">1997 English Broadcast News Speech (HUB4)</a> corpus, from four distinct sources: </p><table> <tr> <td colspan="60%">American Broadcasting Company</td> <td colspan="15%">(ABC)</td> <td colspan="15%">(1998, 2001)</td> </tr> <tr> <td colspan="60%">National Broadcasting Company</td> <td colspan="15%">(NBC)</td> <td colspan="15%">(1998, 2001)</td> </tr> <tr> <td colspan="60%">Public Radio International</td> <td colspan="15%">(PRI)</td> <td colspan="15%">(1998)</td> </tr> <tr> <td colspan="60%">Cable News Network</td> <td colspan="15%">(CNN)</td> <td colspan="15%">(2001)</td> </tr> </table><h3>Annotations</h3> <p> The transcripts within this corpus have been annotated for various kinds of metadata. The goal of MDE is to enable technology that can take raw Speech-To-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, here by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs). The <a href="./docs" rel="nofollow">docs</a> directory contains the complete set of SimpleMDE annotation guidelines used to create this data. </p><h3>Data Format</h3> <p>The data appears in two formats. The <a href="http://agtk.sourceforge.net/doc/aglib/2.0/formats.html#ATLAS" rel="nofollow">AG Atlas (ag.xml) format</a> represents the native annotation format, and utilizes the <a href="http://agtk.sf.net/" rel="nofollow">Annotation Graph Library</a>. This data is best explored using the LDC MDE Toolkit, which is freely available at <a href="http://www.ldc.upenn.edu/Projects/MDE/Tools" rel="nofollow">http://www.ldc.upenn.edu/Projects/MDE/Tools</a>. </p><p> The data is also provided in <a href="http://www.nist.gov/speech/tests/rt/rt2003/fall/index.htm" rel="nofollow">RTTM format</a> developed by NIST to support the EARS Program. The RTTM format labels each token in the reference transcript according to the properties it displays: lexeme vs. non-lexeme; edit, filler, SU, etc. </p><p>Please click here for a <a href="http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2004T12.rttm" rel="nofollow">RTTM file example</a>. </p><p>General information about the EARS MDE Annotation effort, including free annotation tools, annotation guidelines and additional information can be found at LDC's <a href="http://www.ldc.upenn.edu/Projects/MDE/" rel="nofollow"> EARS MDE Project Page</a>. </p><h3>Updates</h3> <p>There are no updates available at this time. </p> Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania </br> The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

<h3>引言</h3> <p> MDE RT-03训练数据文本与标注语料库由语言数据联盟（Linguistic Data Consortium，LDC）制作，目录号为LDC2004T12，ISBN为1-58563-301-1。 </p><p> 该数据最初旨在支持美国国防高级研究计划局（DARPA）高效、经济、可复用语音转文本（Efficient, Affordable, Reusable Speech-to-Text，EARS）项目中的元数据提取（Metadata Extraction，MDE）工作。EARS MDE的目标是开发技术，将原始语音转文本输出精炼为对人类和下游自动流程更有用的形式。 </p><p>本版本数据包含英语会话电话语音（Conversational Telephone Speech，CTS）和广播新闻（Broadcast News，BN）的转录文本与标注。对应的语音数据可通过<a href="http://catalog.ldc.upenn.edu/LDC2004S08" rel="nofollow">MDE RT-03训练数据语音</a>获取。 </p><h3>数据</h3> <p>数据集包含633个文件，总大小约747 MB，共含764,978个Token。转录文本与标注覆盖约20小时的广播新闻和超过40小时的会话电话语音数据。标注数据最初为支持DARPA EARS元数据提取（MDE）项目开发，作为RT-03F评估周期的训练数据分发。 </p><p>CTS数据源自<a href="http://catalog.ldc.upenn.edu/LDC97S62" rel="nofollow">Switchboard-1 Release 2</a>语料库。 </p><p>BN语音数据源自<a href="http://catalog.ldc.upenn.edu/LDC98S71" rel="nofollow">1997年英语广播新闻语音（HUB4）</a>语料库，来自四个不同来源： </p><table> <tr> <td colspan="60%">美国广播公司</td> <td colspan="15%">(ABC)</td> <td colspan="15%">(1998, 2001)</td> </tr> <tr> <td colspan="60%">全国广播公司</td> <td colspan="15%">(NBC)</td> <td colspan="15%">(1998, 2001)</td> </tr> <tr> <td colspan="60%">公共广播国际</td> <td colspan="15%">(PRI)</td> <td colspan="15%">(1998)</td> </tr> <tr> <td colspan="60%">有线电视新闻网</td> <td colspan="15%">(CNN)</td> <td colspan="15%">(2001)</td> </tr> </table><h3>标注</h3> <p>本语料库中的转录文本已针对多种元数据类型进行标注。MDE的目标是开发技术，将原始语音转文本输出精炼为对人类和下游自动流程更有用的形式。简言之，这意味着生成具有最高可读性的自动转录文本。为此，LDC定义了SimpleMDE标注任务。在SimpleMDE框架下，标注员识别四类填充词：如“uh”和“um”的有声停顿、“you know”等话语标记、插入语与附加说明，以及“sorry”和“I mean”等编辑术语。此外还识别编辑不流畅现象：标记不流畅（或相邻不流畅串）的完整范围及中断点。标注员进一步识别语义单元（Semantic Unit，SU）——即话语中能表达说话者完整思想或观点的单元。与不流畅标注类似，SU标注的目标是提升转录文本可读性，通过将信息拆分为小型、结构化、连贯的块而非长段或完整故事实现。句子级SU包含四类：陈述、疑问、反馈语和不完整SU。为增强标注员间一致性，标注任务还识别多种子句级SU边界（并列和从句SU）。<a href="./docs" rel="nofollow">文档</a>目录包含创建本数据所用的完整SimpleMDE标注指南。 </p><h3>数据格式</h3> <p>数据以两种格式呈现。<a href="http://agtk.sourceforge.net/doc/aglib/2.0/formats.html#ATLAS" rel="nofollow">AG Atlas格式（ag.xml）</a>为原生标注格式，使用<a href="http://agtk.sf.net/" rel="nofollow">标注图库（Annotation Graph Library）</a>。探索该数据的最佳工具为LDC MDE工具包，可从<a href="http://www.ldc.upenn.edu/Projects/MDE/Tools" rel="nofollow">http://www.ldc.upenn.edu/Projects/MDE/Tools</a>免费获取。 </p><p>数据还提供<a href="http://www.nist.gov/speech/tests/rt/rt2003/fall/index.htm" rel="nofollow">RTTM格式</a>，该格式由美国国家标准与技术研究院（NIST）开发以支持EARS项目。RTTM格式根据参考转录文本中每个Token的属性（词汇项vs非词汇项；编辑、填充词、SU等）进行标记。 </p><p>请点击此处查看<a href="http://www.ldc.upenn.edu/Catalog/desc/addenda/LDC2004T12.rttm" rel="nofollow">RTTM文件示例</a>。 </p><p>关于EARS MDE标注工作的通用信息（包括免费标注工具、标注指南及其他信息）可在LDC的<a href="http://www.ldc.upenn.edu/Projects/MDE/" rel="nofollow">EARS MDE项目页面</a>获取。 </p><h3>更新</h3> <p>目前无可用更新。 </p> 部分内容 ©1998美国广播公司（American Broadcasting Company, Inc.），©1997-98有线电视新闻网（Cable News Network, Inc.），©1997公共广播国际（Public Radio International），©1997国家有线卫星公司（National Cable Satellite Corporation），©2004宾夕法尼亚大学董事会 </br>《世界》节目为公共广播国际与英国广播公司联合制作，由波士顿WGBH电台出品。

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集