Machine Reading Phase 1 IC Training Data

Name: Machine Reading Phase 1 IC Training Data
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:34:05
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2020T04

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>Machine Reading Phase 1 IC Training Data was developed by the Linguistic Data Consortium and contains 248 English source documents and 116 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program.</p><br> <p>The Machine Reading (MR) program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.</p><br> <p>The data in this release constitutes the training data for the IC (Core Domain) task. The IC Use Cases tested the core domain by extracting information about about Entities (people, organizations, geopolitical entities or "GPEs") and their involvement in four types of Relations: Attack Relations (e.g. bombings), Biographical Relations (e.g. being a citizen of a country), Affiliation Relations (e.g. being a leader of an organization), and Family Relations (e.g. having a spouse) as described in newswire text. This information was then aligned with an IC Use Cases ontology that would allow automated reasoning about the extracted Entities and Relations.</p><br> <h3>Data</h3><br> <p>This release contains 248 source documents (108,960 words) from English newswire stories in English Gigaword Fourth Edition (<a href="../../../LDC2009T13">LDC2009T13</a>). Roughly half of those documents (116) were annotated for IC/Core Use Cases. Annotation was non-exhaustive, but an attempt was made to provide instances of all relations and their arguments where explicitly stated in a single sentence, as well as some non-explicit relations, which were marked with an "Inferred" tag by the annotator.</p><br> <p>Annotations are in GUI XML (traditional annotation) and RDF XML (formal knowledge representation) formats. A second set of GUI XML is provided with additional, unofficial annotations. All source and annotation files are presented as UTF-8 encoded XML files with associated dtds, schemas or ontologies.</p><br> <h3>Acknowledgments</h3><br> <p>The Linguistic Data Consortium gratefully acknowledges the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government.</p><br> <h3>Samples</h3><br> <p>Please view the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2020T04.src.xml">Source</a></li><br> <li><a href="desc/addenda/LDC2020T04.rdf.xml">RDF XML</a></li><br> <li><a href="desc/addenda/LDC2020T04.gui.xml">GUI XML</a></li><br> <li><a href="desc/addenda/LDC2020T04.gui_x.xml">GUI XML Extended</a></li><br> </ul><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 1994-1997, 2001-2006 Agence France Presse, © 2002 An Nahar, ©1995-1998, 2000-2001, 2005-2006 The Associated Press, © 1996-1998, 2004, 2006 Los Angeles Times-Washington Post News Service, Inc., © 1994-2002, 2004-2006 New York Times, © 1994 Reuters America, Inc., © 1995-2006 Xinhua News Agency, © 2009, 2020 Trustees of the University of Pennsylvania

<h3>概述</h3><br><p>机器阅读第一阶段IC训练数据由语言数据联盟（Linguistic Data Consortium）开发，包含用于美国国防高级研究计划局（Defense Advanced Research Projects Agency，简称DARPA）机器阅读项目的248份英文源文档与116份分离式标注（standoff annotation）文件。</p><br><p>机器阅读（Machine Reading，简称MR）项目旨在研发自动化阅读系统，以弥合自然语言文本所蕴含的知识与形式化推理系统可获取的知识之间的鸿沟。项目参与方开发的阅读系统需从多领域文本中提取事实并开展推理。</p><br><p>本发布数据为IC（核心领域）任务的训练数据。IC用例通过从新闻专线文本中提取实体（Entity，包括人物、组织、地缘政治实体，即GPEs）及其参与的四类关系来测试核心领域，四类关系分别为：攻击关系（如爆炸事件）、传记关系（如某国公民身份）、隶属关系（如担任某组织领导者）以及家庭关系（如拥有配偶）。随后，这些信息将与IC用例本体（ontology）对齐，以支持对提取出的实体与关系开展自动化推理。</p><br><h3>数据说明</h3><br><p>本发布数据包含源自《英文千兆词第四版》（English Gigaword Fourth Edition，语料编号<a href="../../../LDC2009T13">LDC2009T13</a>）的248份英文新闻专线源文档（共计108,960词）。其中约半数文档（116份）针对IC/核心用例完成标注。本次标注并非全覆盖，但已尽力涵盖单句中明确提及的所有关系及其论元（argument）实例，同时纳入部分非显式关系，标注人员会为这类关系添加「推断（Inferred）」标签。</p><br><p>标注文件采用GUI XML（传统标注格式）与RDF XML（资源描述框架XML，Resource Description Framework XML）两种格式。此外还提供了一组带有额外非官方标注的GUI XML文件。所有源文档与标注文件均为UTF-8编码的XML文件，并附带对应的文档类型定义（Document Type Definition，简称DTD）、模式文件或本体文件。</p><br><h3>致谢</h3><br><p>语言数据联盟（Linguistic Data Consortium）衷心感谢美国空军研究实验室（Air Force Research Laboratory，简称AFRL）主合同FA8750-09 C-xxxx项下美国国防高级研究计划局（DARPA）机器阅读项目的资助。本材料中表达的任何观点、发现、结论或建议仅代表作者本人，未必反映DARPA、AFRL或美国政府的立场。</p><br><h3>示例</h3><br><p>请查看以下示例：</p><br><ul><br><li><a href="desc/addenda/LDC2020T04.src.xml">源文件</a></li><br><li><a href="desc/addenda/LDC2020T04.rdf.xml">RDF XML</a></li><br><li><a href="desc/addenda/LDC2020T04.gui.xml">GUI XML</a></li><br><li><a href="desc/addenda/LDC2020T04.gui_x.xml">扩展GUI XML</a></li><br></ul><br><h3>更新说明</h3><br><p>暂无更新。</p><br><p>部分内容 © 1994-1997、2001-2006 法新社（Agence France Presse），© 2002 《今日新闻报》（An Nahar），©1995-1998、2000-2001、2005-2006 美联社（The Associated Press），© 1996-1998、2004、2006 洛杉矶时报-华盛顿邮报新闻服务公司（Los Angeles Times-Washington Post News Service, Inc.），© 1994-2002、2004-2006 《纽约时报》（New York Times），© 1994 路透美国公司（Reuters America, Inc.），© 1995-2006 新华通讯社，© 2009、2020 宾夕法尼亚大学托管委员会（Trustees of the University of Pennsylvania）</p>

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集