five

OntoNotes Release 5.0

收藏
DataCite Commons2024-12-29 更新2024-07-13 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2013T19
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between <a href="http://www.bbn.com/" rel="nofollow">BBN Technologies</a>, the <a href="http://www.colorado.edu/" rel="nofollow">University of Colorado</a>, the <a href="http://www.upenn.edu/" rel="nofollow">University of Pennsylvania</a> and the <a href="http://www.isi.edu/home" rel="nofollow">University of Southern Californias Information Sciences Institute</a>. The goal of the project was to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).</p><br> <p>OntoNotes Release 5.0 contains the content of earlier releases -- OntoNotes Release 1.0 <a href="http://catalog.ldc.upenn.edu/LDC2007T21" rel="nofollow">LDC2007T21</a>, OntoNotes Release 2.0 <a href="http://catalog.ldc.upenn.edu/LDC2008T04" rel="nofollow">LDC2008T04</a>, OntoNotes Release 3.0 <a href="http://catalog.ldc.upenn.edu/LDC2009T24" rel="nofollow">LDC2009T24</a> and OntoNotes Release 4.0 <a href="http://catalog.ldc.upenn.edu/LDC2011T03" rel="nofollow">LDC2011T03</a> -- and adds source data from and/or additional annotations for, newswire (News), broadcast news (BN), broadcast conversation (BC), telephone conversation (Tele) and web data (Web) in English and Chinese and newswire data in Arabic. Also contained is English pivot text (Old Testament and New Testament text). This cumulative publication consists of 2.9 million words with counts shown in the table below.</p><br> <table><br> <tbody><br> <tr><br> <td>&nbsp;</td><br> <td>Arabic</td><br> <td>English</td><br> <td>Chinese</td><br> </tr><br> <tr><br> <td>News</td><br> <td>300k</td><br> <td>625k</td><br> <td>250k</td><br> </tr><br> <tr><br> <td>BN</td><br> <td>n/a</td><br> <td>200k</td><br> <td>250k</td><br> </tr><br> <tr><br> <td>BC</td><br> <td>n/a</td><br> <td>200k</td><br> <td>150k</td><br> </tr><br> <tr><br> <td>Web</td><br> <td>n/a</td><br> <td>300k</td><br> <td>150k</td><br> </tr><br> <tr><br> <td>Tele</td><br> <td>n/a</td><br> <td>120k</td><br> <td>100k</td><br> </tr><br> <tr><br> <td>Pivot</td><br> <td>n/a</td><br> <td>n/a</td><br> <td>300</td><br> </tr><br> </tbody><br> </table><br> <p>&nbsp;</p><br> <p>The OntoNotes project built on two time-tested resources, following the <a href="http://catalog.ldc.upenn.edu/LDC99T42" rel="nofollow">Penn Treebank</a> for syntax and the <a href="http://catalog.ldc.upenn.edu/LDC2004T14" rel="nofollow">Penn PropBank</a> for predicate-argument structure. Its semantic representation includes word sense disambiguation for nouns and verbs, with some word senses connected to an ontology, and coreference.</p><br> <h3>Data</h3><br> <p>Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database (ontonotes-v5.0.sql.gz) with a Python API to provide convenient cross-layer access.</p><br> <p>It is a known issue that this release contains some non-validating XML files. The included tools, however, use a non-validating XML parser to parse the .xml files and load the appropriate values.</p><br> <h3>Tools</h3><br> <p>This release includes OntoNotes DB Tool v0.999 beta, the tool used to assemble the database from the original annotation files. It can be found in the directory tools/ontonotes-db-tool-v0.999b. This tool can be used to derive various views of the data from the database, and it provides an API that can implement new queries or views. Licensing information for the OntoNotes DB Tool package is included in its source directory.</p><br> <h3>Samples</h3><br> <p>Please view these samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2013T19.cmn.jpg" rel="nofollow">Chinese</a></li><br> <li><a href="desc/addenda/LDC2013T19.ara.jpg" rel="nofollow">Arabic</a></li><br> <li><a href="desc/addenda/LDC2013T19.eng.jpg" rel="nofollow">English</a></li><br> </ul><br> <h3>Updates</h3><br> <p>Additional documentation was added on December 11, 2014&nbsp; and is included in downloads after that date.&nbsp;</p><br> <h3>Acknowledgment</h3><br> <p>This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.</p></br> Portions © 2006 Abu Dhabi TV, © 2006 Agence France Presse, © 2006 Al-Ahram, © 2006 Al Alam News Channel, © 2006 Al Arabiya, © 2006 Al Hayat, © 2006 Al Iraqiyah, © 2006 Al Quds-Al Arabi, © 2006 Anhui TV, © 2002, 2006 An Nahar, © 2006 Asharq-al-Awsat, © 2010 Bible League International, © 2005 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2006 China Central TV, © 2006 China Military Online, © 2000-2001 China National Radio, © 2006 Chinanews.com, © 2000-2001 China Television System, © 1989 Dow Jones & Company, Inc., © 2006 Dubai TV, © 2006 Guangming Daily, © 2006 Kuwait TV, © 2005-2006 National Broadcasting Company, Inc., © 2006 New Tang Dynasty TV, © 2006 Nile TV, © 2006 Oman TV, © 2006 PAC Ltd, © 2006 Peoples Daily Online, © 2005-2006 Phoenix TV, © 2000-2001 Sinorama Magazine, © 2006 Syria TV, © 1996-1998, 2006 Xinhua News Agency, © 1996, 1997, 2005, 2007, 2008, 2009, 2011, 2013 Trustees of the University of Pennsylvania

<h3>介绍</h3><br><p>OntoNotes 5.0版本是OntoNotes项目的最终发布版本,该项目由BBN技术公司(BBN Technologies)、科罗拉多大学(University of Colorado)、宾夕法尼亚大学(University of Pennsylvania)以及南加州大学信息科学研究所(University of Southern California's Information Sciences Institute)合作开展。本项目的目标是为涵盖三大语言(英语、中文与阿拉伯语)、多种文本体裁(新闻、会话电话语音、博客、Usenet新闻组、广播节目、脱口秀)的大型语料库添加标注,标注内容包含结构信息(句法与谓词论元结构)以及浅层语义信息(关联至本体的词义与共指信息)。</p><br><p>OntoNotes 5.0版本包含了此前所有发布版本的内容——即OntoNotes 1.0(编号LDC2007T21)、OntoNotes 2.0(编号LDC2008T04)、OntoNotes 3.0(编号LDC2009T24)以及OntoNotes 4.0(编号LDC2011T03)——同时新增了英语、中文的新闻专线(News)、广播新闻(BN)、广播会话(BC)、电话会话(Tele)与网络数据(Web),以及阿拉伯语的新闻专线数据的源数据和/或额外标注。此外还包含英语枢轴文本(旧约与新约经文)。本次累计发布的语料共计290万词,各语言体裁的词量统计如下表所示。</p><br><table><br><tbody><br><tr><br><td>&nbsp;</td><br><td>阿拉伯语</td><br><td>英语</td><br><td>中文</td><br></tr><br><tr><br><td>新闻专线</td><br><td>300k</td><br><td>625k</td><br><td>250k</td><br></tr><br><tr><br><td>广播新闻</td><br><td>n/a</td><br><td>200k</td><br><td>250k</td><br></tr><br><tr><br><td>广播会话</td><br><td>n/a</td><br><td>200k</td><br><td>150k</td><br></tr><br><tr><br><td>网络数据</td><br><td>n/a</td><br><td>300k</td><br><td>150k</td><br></tr><br><tr><br><td>电话会话</td><br><td>n/a</td><br><td>120k</td><br><td>100k</td><br></tr><br><tr><br><td>枢轴文本</td><br><td>n/a</td><br><td>n/a</td><br><td>300</td><br></tr><br></tbody><br></table><br><p>&nbsp;</p><br><p>OntoNotes项目基于两项经过时间验证的资源构建:句法标注参考了宾夕法尼亚树库(Penn Treebank,LDC99T42),谓词论元结构标注参考了宾夕法尼亚谓词论元库(Penn PropBank,LDC2004T14)。其语义表征涵盖名词与动词的词义消歧,部分词义关联至本体,同时包含共指标注。</p><br><h3>数据</h3><br><p>本发布版本的文档目录中包含了标注指南,以及从数据库中导出数据各维度视图的相关流程说明。标注数据既按照各标注层(树库、谓词论元库、词义标注等)分别存储为独立文本文件,也以集成关系数据库(ontonotes-v5.0.sql.gz)的形式提供,并附带Python应用程序编程接口(API)以实现便捷的跨层访问。</p><br><p>已知本发布版本包含部分非验证型XML文件,但配套工具将使用非验证型XML解析器处理.xml文件并加载对应数值。</p><br><h3>工具</h3><br><p>本发布版本包含OntoNotes数据库工具v0.999测试版,该工具用于从原始标注文件组装数据库,可在tools/ontonotes-db-tool-v0.999b目录中找到。此工具可用于从数据库中导出数据的各类视图,同时提供了可用于实现自定义查询或视图的应用程序编程接口。OntoNotes数据库工具包的授权信息包含在其源码目录中。</p><br><h3>示例</h3><br><p>请查看以下示例:</p><br><ul><br><li><a href="desc/addenda/LDC2013T19.cmn.jpg" rel="nofollow">中文</a></li><br><li><a href="desc/addenda/LDC2013T19.ara.jpg" rel="nofollow">阿拉伯语</a></li><br><li><a href="desc/addenda/LDC2013T19.eng.jpg" rel="nofollow">英语</a></li><br></ul><br><h3>更新</h3><br><p>2014年12月11日新增了额外的文档,该文档已包含在当日之后的下载包中。</p><br><h3>致谢</h3><br><p>本研究部分由美国国防高级研究计划局(Defense Advanced Research Projects Agency)GALE项目(资助编号HR0011-06-1-003)支持。本出版物的内容不一定代表政府的立场或政策,不应被视为获得官方认可。</p><br><p>部分内容© 2006 阿布扎比电视台、© 2006 法新社、© 2006 《金字塔报》、© 2006 阿拉姆新闻频道、© 2006 阿拉伯电视台、© 2006 《生活报》、© 2006 伊拉克电视台、© 2006 《圣城报》-阿拉伯版、© 2006 安徽电视台、© 2002、2006 《贝鲁特日报》、© 2006 《中东日报》、© 2010 国际圣经联盟、© 2005 有线电视新闻网有限责任合伙、© 2000-2001 中国广播系统、© 2000-2001、2005-2006 中国中央电视台、© 2006 中国军事在线、© 2000-2001 中国国际广播电台、© 2006 中新网、© 2000-2001 中国电视系统、© 1989 道琼斯公司、© 2006 迪拜电视台、© 2006 《光明日报》、© 2006 科威特电视台、© 2005-2006 美国全国广播公司、© 2006 新唐人电视台、© 2006 尼罗河电视台、© 2006 阿曼电视台、© 2006 PAC有限公司、© 2006 人民网、© 2005-2006 凤凰卫视、© 2000-2001 《地平线杂志》、© 2006 叙利亚电视台、© 1996-1998、2006 新华通讯社、© 1996、1997、2005、2007、2008、2009、2011、2013 宾夕法尼亚大学受托人</p>
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
OntoNotes Release 5.0是OntoNotes项目的最终版本,是一个多语言文本语料库,包含英语、中文和阿拉伯语,数据来源涵盖电话对话、新闻、博客等多种类型。该数据集标注了结构信息(如句法和谓词论元结构)和浅层语义(如词义消歧和共指消解),总词汇量约290万词,主要用于信息提取和信息检索应用。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作