ACE 2007 Multilingual Training Corpus

Name: ACE 2007 Multilingual Training Corpus
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:26:39
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2014T18

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> ACE 2007 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the complete set of Arabic and Spanish training data for the <a href="http://www.itl.nist.gov/iad/mig//tests/ace/2007/">2007 Automatic Content Extraction</a> (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC <a href="https://www.ldc.upenn.edu/collaborations/past-projects/ace">ACE project</a> pages. The LDC Catalog contains a series of publications from the ACE project and from researchers building on that work. Among them are: <ul> <li>ACE-2 Version 1.0 (<a href="../../../LDC2003T11">LDC2003T11</a>)</li> <li>TIDES Extraction (ACE) 2003 Multilingual Training Data (<a href="../../../LDC2004T09">LDC2004T09</a>)</li> <li>ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (<a href="../../../LDC2005T07">LDC2005T07</a>)</li> <li>ACE 2004 Multilingual Training Corpus (<a href="../../../LDC2005T09">LDC2005T09</a>)</li> <li>ACE 2005 Multilingual Training Corpus (<a href="../../../LDC2006T06">LDC2006T06</a>)</li> <li>ACE 2005 English SpatialML Annotations (<a href="../../../LDC2008T03">LDC2008T03</a>)</li> <li>ACE 2005 Mandarin SpatialML Annotations (<a href="../../../LDC2010T09">LDC2010T09</a>)</li> <li>ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 (<a href="../../../LDC2010T18">LDC2010T18</a>)</li> <li>ACE 2005 English SpatialML Annotations Version 2 (<a href="../../../LDC2011T02">LDC2011T02</a>)</li> <li>Datasets for Generic Relation Extraction (reACE) (<a href="../../../LDC2011T08">LDC2011T08</a>)</li> </ul> <h3>Data</h3> The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005. Data selection was semi-automatic. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task. The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM).   <table border="1"> <tbody> <tr><th>Arabic</th></tr> <tr> <td>Words</td> <td> </td> <td> </td> <td>Files</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td> </td> <td>1P</td> <td>2P</td> <td>NORM</td> <td>1P</td> <td>2P</td> <td>NORM</td> </tr> <tr> <td>NW</td> <td>58,015</td> <td>58,015</td> <td>58,015</td> <td>257</td> <td>257</td> <td>257</td> </tr> <tr> <td>WL</td> <td>40,338</td> <td>40,338</td> <td>40,338</td> <td>121</td> <td>121</td> <td>121</td> </tr> <tr> <td>Total</td> <td>98,353</td> <td>98,353</td> <td>98,353</td> <td>378</td> <td>378</td> <td>378</td> </tr> <tr><th>Spanish</th></tr> <tr> <td>Words</td> <td> </td> <td> </td> <td>Files</td> <td> </td> <td> </td> <td> </td> </tr> <tr> <td> </td> <td>1P</td> <td>2P</td> <td>NORM</td> <td>1P</td> <td>2P</td> <td>NORM</td> </tr> <tr> <td>NW</td> <td>100,401</td> <td>100,401</td> <td>100,401</td> <td>352</td> <td>352</td> <td>352</td> </tr> <tr> <td>Total</td> <td>100,401</td> <td>100,401</td> <td>100,401</td> <td>352</td> <td>352</td> <td>352</td> </tr> </tbody> </table>   For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p" and "timex2norm". In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). Note that in many cases, two annotation stages have produced identical output for a given source text, if no changes were made in the latter stage. All files are presented in UTF-8 <h3>Samples</h3> Please view the following samples <ul> <li><a href="desc/addenda/LDC2014T18.sgm.jpg">SGML Sample</a></li> <li><a href="desc/addenda/LDC2014T18.ag.jpg">AG XML Sample</a></li> <li><a href="desc/addenda/LDC2014T18.apf.jpg">APF XML Sample</a></li> <li><a href="desc/addenda/LDC2014T18.tab.txt">Tab Delimited Sample</a></li> </ul> <h3>Updates</h3> None at this time. Portions © 2000, 2005 Agence France Presse, © 2000 Al Hayat, © 2000 An Nahar, © 2005 The Associated Press, © 2005 Xinhua News Agency, © 2005-2007, 2014 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集