five

Annotated English Gigaword

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2012T21
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Annotated English Gigaword was developed by <a href="http://hltcoe.jhu.edu/" rel="nofollow">Johns Hopkins University's Human Language Technology Center of Excellence</a>. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (<a href="http://catalog.ldc.upenn.edu/LDC2011T07" rel="nofollow">LDC2011T07</a>) and also contains an API and tools for reading the dataset's XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers.</p><br> <h3>Data</h3><br> <p>Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources:</p><br> <ul><br> <li>Agence France-Presse, English Service (afp_eng)</li><br> <li>Associated Press Worldstream, English Service (apw_eng)</li><br> <li>Central News Agency of Taiwan, English Service (cna_eng)</li><br> <li>Los Angeles Times/Washington Post Newswire Service (ltw_eng)</li><br> <li>Washington Post/Bloomberg Newswire Service (wpb_eng)</li><br> <li>New York Times Newswire Service (nyt_eng)</li><br> <li>Xinhua News Agency, English Service (xin_eng)</li><br> </ul><br> <p>The following layers of annotation were added:</p><br> <ul><br> <li>Tokenized and segmented sentences</li><br> <li>Treebank-style constituent parse trees</li><br> <li>Syntactic dependency trees</li><br> <li>Named entities</li><br> <li>In-document coreference chains</li><br> </ul><br> <p>The annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded) (2) syntactic parses were derived and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed.</p><br> <p>The data is stored in a form similar to the gigaword SGML format with XML annotations containing the additional markup. The included API provides object representations for the contents of the XML files.</p><br> <h3>Samples</h3><br> <p>Please the link for a <a href="desc/addenda/LDC2012T21.jpg" rel="nofollow">sample</a>.</p><br> <h3>Additional Licensing Information</h3><br> <p>Any 2011 member organization that licensed English Gigaword Fifth Edition (<a href="http://catalog.ldc.upenn.edu/LDC2011T07" rel="nofollow">LDC2011T07</a>) may request a no-cost copy of Annotated English Gigaword. Any non-member organization that licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for a $250 media fee. Please contact <a rel="nofollow"> ldc@ldc.upenn.edu</a> for licensing or with any additional questions.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2012 Matthew R. Gormley, © 2003, 2005, 2007, 2009, 2011, 2012 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作