ACE 2005 Multilingual Training Corpus

Name: ACE 2005 Multilingual Training Corpus
Creator: Linguistic Data Consortium
Published: 2024-11-16 08:55:09
License: 暂无描述

DataCite Commons2024-11-16 更新2024-07-13 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2006T06

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> ACE 2005 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. This represents the complete set of training data in those languages for the 2005 Automatic Content Extraction (ACE) technology evaluation. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and coversational telephone speech. The data was annotated by LDC with support from the ACE Program and additional assistance from LDC. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form. In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation, and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese, and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks. For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's <a href="https://www.ldc.upenn.edu/collaborations/past-projects/ace" rel="nofollow">ACE website</a>. <h3>Data</h3> Below is information about the amount of data in this release and its annotation status. Further information such as breakdown of genres and formats can be found in the associated README file. <ul> <ul> <li>1P: data subject to first pass (complete) annotation</li> <li>DUAL: data also subject to dual first pass (complete) annotation</li> <li>ADJ: data also subject to discrepancy resolution/adjudication</li> <li>NORM: data also subject to TIMEX2 normalization</li> </ul> </ul>   <table border="1" width="50%"> <tbody> <tr> <td colspan="8">English</td> </tr> <tr> <td colspan="4">words</td> <td colspan="4">files</td> </tr> <tr> <td>1P</td> <td>DUAL</td> <td>ADJ</td> <td>NORM</td> <td>1P</td> <td>DUAL</td> <td>ADJ</td> <td>NORM</td> </tr> <tr> <td>303833</td> <td>297185</td> <td>216545</td> <td>259889</td> <td>666</td> <td>650</td> <td>535</td> <td>599</td> </tr> </tbody> </table>   <table border="1" width="35%"> <tbody> <tr> <td colspan="6">Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.</td> </tr> <tr> <td colspan="3">chars</td> <td colspan="3">files</td> </tr> <tr> <td>1P</td> <td>DUAL</td> <td>ADJ</td> <td>1P</td> <td>DUAL</td> <td>ADJ</td> </tr> <tr> <td>334121</td> <td>325834</td> <td>307991</td> <td>687</td> <td>671</td> <td>633</td> </tr> </tbody> </table>   <table border="1" width="35%"> <tbody> <tr> <td colspan="6">Arabic</td> </tr> <tr> <td colspan="3">words</td> <td colspan="3">files</td> </tr> <tr> <td>1P</td> <td>DUAL</td> <td>ADJ</td> <td>1P</td> <td>DUAL</td> <td>ADJ</td> </tr> <tr> <td>112233</td> <td>103504</td> <td>100114</td> <td>433</td> <td>409</td> <td>403</td> </tr> </tbody> </table>   <h3>Samples</h3> For examples of the data in this publication, please review the following samples: <ul> <li><a href="desc/addenda/LDC2006T06.ara.xml">Arabic (XML)</a></li> <li><a href="desc/addenda/LDC2006T06.eng.xml">English (XML)</a></li> <li><a href="desc/addenda/LDC2006T06.cmn.xml">Chinese (XML)</a></li> </ul> <h3>Updates</h3> None at this time. Portions © 2000-2003 Agence France Presse, © 2003 The Associated Press, © 2003 New York Times, © 2000-2001, 2003 Xinhua News Agency, © 2003 Cable News Network LP, LLLP, © 2000-2001 SPH AsiaOne Ltd, © 2000-2001 China Broadcasting System, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 2000-2001 China Central TV, © 2000-2001 Al Hayat, © 2000-2001 An-Nahar, © 2000-2001 Nile TV, © 2005, 2006 Trustees of the University of Pennsylvania

<h3>简介</h3> ACE 2005多语言训练语料库由语言数据联盟（Linguistic Data Consortium, LDC）开发，包含约1800份覆盖英语、阿拉伯语、中文的多体裁文本文件，已针对实体、关系与事件完成标注。本数据集为2005年自动内容抽取（Automatic Content Extraction, ACE）技术评测提供了上述三种语言的完整训练数据。其体裁涵盖新闻专线、广播新闻、广播谈话、博客、论坛以及会话电话语音文本。本数据集由LDC在ACE项目资助及额外协助下完成标注。 ACE项目的目标是研发自动内容抽取技术，以实现对文本形式人类语言的自动化处理。 2005年11月，各参评系统在五大核心任务方向接受性能评测：实体识别、属性值识别、时间表达式识别、关系抽取与事件抽取。实体、关系及事件提及检测作为诊断性任务同步开放。除事件类任务外，其余所有评测任务均支持英语、中文、阿拉伯语三种语言；事件类任务仅针对英语与中文展开。本发布包即为上述评测任务的官方训练数据集。 如需了解ACE项目相关语言资源（含标注规范、任务定义及其他文档），请访问LDC的<a href="https://www.ldc.upenn.edu/collaborations/past-projects/ace" rel="nofollow">ACE官方网站</a>。 <h3>数据集详情</h3> 下文列出了本次发布数据集的规模及标注状态详情，体裁分布、数据格式等更多信息请参阅附带的README文件。 <ul> <ul> <li>1P：经过首轮（完整）标注的数据集</li> <li>DUAL：经过双重首轮（完整）标注的数据集</li> <li>ADJ：经过分歧修正/审定流程的数据集</li> <li>NORM：经过TIMEX2归一化处理的数据集</li> </ul> </ul>   <table border="1" width="50%"> <tbody> <tr> <td colspan="8">英语</td> </tr> <tr> <td colspan="4">单词数</td> <td colspan="4">文件数</td> </tr> <tr> <td>1P</td> <td>DUAL</td> <td>ADJ</td> <td>NORM</td> <td>1P</td> <td>DUAL</td> <td>ADJ</td> <td>NORM</td> </tr> <tr> <td>303833</td> <td>297185</td> <td>216545</td> <td>259889</td> <td>666</td> <td>650</td> <td>535</td> <td>599</td> </tr> </tbody> </table>   <table border="1" width="35%"> <tbody> <tr> <td colspan="6">中文 注：中文数据以字符数统计，我们默认1.5个字符对应1个单词。</td> </tr> <tr> <td colspan="3">字符数</td> <td colspan="3">文件数</td> </tr> <tr> <td>1P</td> <td>DUAL</td> <td>ADJ</td> <td>1P</td> <td>DUAL</td> <td>ADJ</td> </tr> <tr> <td>334121</td> <td>325834</td> <td>307991</td> <td>687</td> <td>671</td> <td>633</td> </tr> </tbody> </table>   <table border="1" width="35%"> <tbody> <tr> <td colspan="6">阿拉伯语</td> </tr> <tr> <td colspan="3">单词数</td> <td colspan="3">文件数</td> </tr> <tr> <td>1P</td> <td>DUAL</td> <td>ADJ</td> <td>1P</td> <td>DUAL</td> <td>ADJ</td> </tr> <tr> <td>112233</td> <td>103504</td> <td>100114</td> <td>433</td> <td>409</td> <td>403</td> </tr> </tbody> </table>   <h3>数据样例</h3> 如需查看本发布包中的数据示例，请参阅以下样例： <ul> <li><a href="desc/addenda/LDC2006T06.ara.xml">阿拉伯语（XML格式）</a></li> <li><a href="desc/addenda/LDC2006T06.eng.xml">英语（XML格式）</a></li> <li><a href="desc/addenda/LDC2006T06.cmn.xml">中文（XML格式）</a></li> </ul> <h3>更新记录</h3> 暂无更新。 部分内容 © 2000-2003 法新社、© 2003 美联社、© 2003 《纽约时报》、© 2000-2001及2003 新华通讯社、© 2003 有线电视新闻网LP、LLLP、© 2000-2001 新加坡报业控股AsiaOne有限公司、© 2000-2001 中国广播系统、© 2000-2001 中国国际广播电台、© 2000-2001 中国电视系统、© 2000-2001 中国中央电视台、© 2000-2001 《生活报》（Al Hayat）、© 2000-2001 《今日消息报》（An-Nahar）、© 2000-2001 尼罗河电视台、© 2005、2006 宾夕法尼亚大学托管会

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

搜集汇总

数据集介绍