ACE 2004 Multilingual Training Corpus

Name: ACE 2004 Multilingual Training Corpus
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:17:45
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2005T09

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants in the 2004 ACE evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic. The current publication consists of the official training data for these evaluation tasks. A seventh evaluation area, Timex Detection and Recognition, is supported by the ACE Time Normalization (TERN) 2004 English Training Data Corpus (<a href="../../../LDC2005T07">LDC2005T07</a>). The TERN corpus source data largely overlaps with the English source data contained in the current release. For more information about linguistic resources for the ACE program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's <a href="https://www.ldc.upenn.edu/collaborations/past-projects/ace">ACE website</a>. <h3>Samples</h3> The files listed below are samples from the English data. They should provide a good example of the material in this corpus. <ul> <li><a href="desc/addenda/LDC2005T09_chtb_apf.xml" rel="nofollow">Chinese Treebank</a></li> <li><a href="desc/addenda/LDC2005T09_fsh_apf.xml" rel="nofollow">Fisher Transcripts</a></li> <li><a href="desc/addenda/LDC2005T09_bnews_apf.xml" rel="nofollow">Broadcast News</a></li> </ul> The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. Portions (c) 1994-1998, 2000 Xinhua News Agency (c) 1997 Department of Information Services, Hong Kong Special Administrative Region (c) 1996-1998, 2000-2001 Sinorama Magazine (c) 2000 Agence France-Presse, (c) 2000 New York Times, (c) 2000 Associated Press, (c) 2000 SPH AsiaOne, Ltd. (Zaobao), (c) 2000 An-Nahar, (c) 2000 Al-Hayat, (c) 2000 Nile TV, (c) 2000 Cable News Network, All Rights Reserved, (c) 2000 American Broadcasting Corporation, (c) 2000 National Broadcasting Company, Inc., (c) 2000 China National Radio, (c) 2000 China Television System, (c) 2000 China Central TV, (c) 2000 China Broadcasting System, (c) 2000 Public Radio International., (c) 2005 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston

<h3>引言</h3> ACE 2004多语言训练语料库包含2004年自动内容抽取（Automatic Content Extraction，ACE）技术评估所需的完整英文、阿拉伯文和中文训练数据。该语料库涵盖多种类型数据，均标注了实体与关系信息，由语言数据联盟（Linguistic Data Consortium）在ACE项目支持下创建，并获美国国防高级研究计划局（DARPA）跨语言信息检测、抽取与摘要（Translingual Information Detection, Extraction and Summarization，TIDES）项目额外协助。此前，该数据曾以电子语料库（LDC2004E17）形式分发给2004年ACE评估参与者。 ACE项目的目标是开发自动内容抽取技术，支持文本形式人类语言的自动处理。2004年9月，各参与机构在六个领域接受系统性能评估：实体检测与识别（Entity Detection and Recognition，EDR）、实体提及检测（Entity Mention Detection，EMD）、EDR共指、关系检测与识别（Relation Detection and Recognition，RDR）、关系提及检测（Relation Mention Detection，RMD）及基于参考实体的RDR。所有任务均以英语、中文和阿拉伯语三种语言开展评估。 本出版物包含上述评估任务的官方训练数据。第七个评估领域——时间表达式检测与识别（Timex Detection and Recognition）——由ACE时间归一化（TERN）2004年英文训练数据语料库（LDC2005T07）提供支持，其源数据与本版本英文源数据存在大量重叠。 如需了解ACE项目语言资源详情（含标注指南、任务定义、免费标注工具及其他文档），请访问语言数据联盟的<a href="https://www.ldc.upenn.edu/collaborations/past-projects/ace">ACE网站</a>。 <h3>样本</h3> 以下文件为英文数据样本，可充分展示本语料库内容特征。 <ul> <li><a href="desc/addenda/LDC2005T09_chtb_apf.xml" rel="nofollow">中文树库</a></li> <li><a href="desc/addenda/LDC2005T09_fsh_apf.xml" rel="nofollow">Fisher转录文本</a></li> <li><a href="desc/addenda/LDC2005T09_bnews_apf.xml" rel="nofollow">广播新闻</a></li> </ul> 《世界》节目由公共广播国际（Public Radio International）与英国广播公司（British Broadcasting Corporation）联合制作，于波士顿WGBH电台录制。 Portions (c) 1994-1998, 2000 新华通讯社 (c) 1997 香港特别行政区政府新闻处 (c) 1996-1998, 2000-2001 《光华杂志》 (c) 2000 法新社 (c) 2000 《纽约时报》 (c) 2000 美联社 (c) 2000 新加坡报业控股亚洲壹有限公司（Zaobao） (c) 2000 《安纳哈》报 (c) 2000 《 hayat》报 (c) 2000 尼罗河电视台 (c) 2000 美国有线电视新闻网（Cable News Network），保留所有权利 (c) 2000 美国广播公司 (c) 2000 全国广播公司 (c) 2000 中国国家广播电台 (c) 2000 中国电视公司 (c) 2000 中国中央电视台 (c) 2000 中国广播系统 (c) 2000 公共广播国际 (c) 2005 宾夕法尼亚大学董事会 《世界》节目由公共广播国际与英国广播公司联合制作，于波士顿WGBH电台录制

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集