Annotated English Gigaword
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2012T21
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>Annotated English Gigaword was developed by <a href="http://hltcoe.jhu.edu/" rel="nofollow">Johns Hopkins University's Human Language Technology Center of Excellence</a>. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (<a href="http://catalog.ldc.upenn.edu/LDC2011T07" rel="nofollow">LDC2011T07</a>) and also contains an API and tools for reading the dataset's XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers.</p><br>
<h3>Data</h3><br>
<p>Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources:</p><br>
<ul><br>
<li>Agence France-Presse, English Service (afp_eng)</li><br>
<li>Associated Press Worldstream, English Service (apw_eng)</li><br>
<li>Central News Agency of Taiwan, English Service (cna_eng)</li><br>
<li>Los Angeles Times/Washington Post Newswire Service (ltw_eng)</li><br>
<li>Washington Post/Bloomberg Newswire Service (wpb_eng)</li><br>
<li>New York Times Newswire Service (nyt_eng)</li><br>
<li>Xinhua News Agency, English Service (xin_eng)</li><br>
</ul><br>
<p>The following layers of annotation were added:</p><br>
<ul><br>
<li>Tokenized and segmented sentences</li><br>
<li>Treebank-style constituent parse trees</li><br>
<li>Syntactic dependency trees</li><br>
<li>Named entities</li><br>
<li>In-document coreference chains</li><br>
</ul><br>
<p>The annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded) (2) syntactic parses were derived and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed.</p><br>
<p>The data is stored in a form similar to the gigaword SGML format with XML annotations containing the additional markup. The included API provides object representations for the contents of the XML files.</p><br>
<h3>Samples</h3><br>
<p>Please the link for a <a href="desc/addenda/LDC2012T21.jpg" rel="nofollow">sample</a>.</p><br>
<h3>Additional Licensing Information</h3><br>
<p>Any 2011 member organization that licensed English Gigaword Fifth Edition (<a href="http://catalog.ldc.upenn.edu/LDC2011T07" rel="nofollow">LDC2011T07</a>) may request a no-cost copy of Annotated English Gigaword. Any non-member organization that licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for a $250 media fee. Please contact <a rel="nofollow"> ldc@ldc.upenn.edu</a> for licensing or with any additional questions.</p><br>
<h3>Updates</h3><br>
<p>None at this time.</p></br>
Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2012 Matthew R. Gormley, © 2003, 2005, 2007, 2009, 2011, 2012 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



