OntoNotes Release 4.0
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2011T03
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>OntoNotes Release 4.0, Linguistic Data Consortium (LDC) catalog number LDC2011T03 and isbn 1-58563-574-X, was developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes Release 4.0 is supported by the Defense Advance Research Project Agency, GALE Program Contract No. HR0011-06-C-0022.</p><br>
<p>OntoNotes Release 4.0 contains the content of earlier releases -- <a href="http://catalog.ldc.upenn.edu/LDC2007T21" rel="nofollow">OntoNotes Release 1.0 LDC2007T21</a>,<a href="http://catalog.ldc.upenn.edu/LDC2008T04" rel="nofollow"> OntoNotes Release 2.0 LDC2008T04</a> and <a href="http://catalog.ldc.upenn.edu/LDC2009T24" rel="nofollow">OntoNotes Release 3.0 LDC2009T24</a> -- and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.</p><br>
<p>The OntoNotes project builds on two time-tested resources, following the <a href="http://catalog.ldc.upenn.edu/LDC99T42" rel="nofollow"> Penn Treebank</a> for syntax and the <a href="http://catalog.ldc.upenn.edu/LDC2004T14" rel="nofollow">Penn PropBank</a> for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years. </p><br>
<h3>Data</h3><br>
<p>Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database (ontonotes-v4.0.sql.gz) with a Python API to provide convenient cross-layer access.</p><br>
<h3>Tools</h3><br>
<p>This release includes OntoNotes DB Tool v0.999 beta, the tool used to assemble the database from the original annotation files. It can be found in the directory ontonotes-db-tool-v0.999b. This tool can be used to derive various views of the data from the database, and it provides an API that can implement new queries or views. Licensing information for the OntoNotes DB Tool package is included in its source directory.</p><br>
<h3>Updates</h3><br>
<p>On May 21st, 2013 an update was issued to fix some bracketing errors in the follolwing file (ontonotes-release-4.0/data/files/data/english/annotations/nw/wsj/05/wsj_0560.parse), all corpora ordered after this date will include the update. Please contact <a href="ldc@ldc.upenn.edu" rel="nofollow">ldc@ldc.upenn.edu</a> for more information or to obtain the updated file.</p><br>
<h3>Sponsorship</h3><br>
<p>This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.</p><br>
<h3>Samples</h3><br>
<ul><br>
<li><a href="desc/addenda/LDC2009T24_arb.jpg" rel="nofollow">Arabic</a></li><br>
<li><a href="desc/addenda/LDC2009T24_chn.jpg" rel="nofollow">Chinese</a></li><br>
<li><a href="desc/addenda/LDC2009T24_eng.jpg" rel="nofollow">English</a></li><br>
</ul><br>
<p> </p></br>
Portions © 2006 Abu Dhabi TV, © 2006 Agence France Presse, © 2006 Al-Ahram, © 2006 Al Alam News Channel, © 2006 Al Arabiya, © 2006 Al Hayat, © 2006 Al Iraqiyah, © 2006 Al Quds-Al Arabi, © 2006 Anhui TV, © 2002, 2006 An Nahar, © 2006 Asharq-al-Awsat, © 2005 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2006 China Central TV, © 2006 China Military Online, © 2000-2001 China National Radio, © 2006 Chinanews.com, © 2000-2001 China Television System, © 1989 Dow Jones & Company, Inc., © 2006 Dubai TV, © 2006 Guangming Daily, © 2006 Kuwait TV, © 2005-2006 National Broadcasting Company, Inc., © 2006 New Tang Dynasty TV, © 2006 Nile TV, © 2006 Oman TV, © 2006 PAC Ltd, © 2006 Peoples Daily Online, © 2005-2006 Phoenix TV, © 2000-2001 Sinorama Magazine, © 2006 Syria TV, © 1996-1998, 2006 Xinhua News Agency, © 2007, 2008, 2009, 2011 Trustees of the University of Pennsylvania
<h3>简介</h3><br><p>语言数据联盟(Linguistic Data Consortium, LDC)目录号为LDC2011T03、ISBN为1-58563-574-X的OntoNotes 4.0版数据集(OntoNotes Release 4.0),是OntoNotes项目的成果之一。该项目由BBN Technologies、科罗拉多大学、宾夕法尼亚大学以及南加州大学信息科学研究所联合开展。项目目标是为涵盖英语、汉语、阿拉伯语三种语言的多体裁文本语料库(包含新闻、电话会话语音、博客、Usenet新闻组、广播节目、脱口秀)标注结构化信息(句法与谓词论元结构)以及浅层语义信息(关联至本体的词义与共指信息)。OntoNotes 4.0版数据集由美国国防高级研究计划局(Defense Advanced Research Projects Agency, DARPA)GALE项目合同HR0011-06-C-0022资助支持。</p><br><p>OntoNotes 4.0版数据集包含此前发布的版本——<a href="http://catalog.ldc.upenn.edu/LDC2007T21" rel="nofollow">OntoNotes 1.0版(LDC2007T21)</a>、<a href="http://catalog.ldc.upenn.edu/LDC2008T04" rel="nofollow">OntoNotes 2.0版(LDC2008T04)</a>以及<a href="http://catalog.ldc.upenn.edu/LDC2009T24" rel="nofollow">OntoNotes 3.0版(LDC2009T24)</a>,并新增了英语与汉语的新闻专线、广播新闻、广播会话及网页数据,以及阿拉伯语的新闻专线数据。本次累计发布的语料总词量达240万,具体分布如下:阿拉伯语新闻专线30万词,汉语新闻专线25万词、汉语广播新闻25万词、汉语广播会话15万词、汉语网页文本15万词,英语新闻专线60万词、英语广播新闻20万词、英语广播会话20万词以及英语网页文本30万词。</p><br><p>OntoNotes项目依托两项久经考验的资源构建:句法标注参考<a href="http://catalog.ldc.upenn.edu/LDC99T42" rel="nofollow">宾夕法尼亚树库(Penn Treebank)</a>,谓词论元结构标注参考<a href="http://catalog.ldc.upenn.edu/LDC2004T14" rel="nofollow">宾夕法尼亚谓词论元库(Penn PropBank)</a>。其语义表征涵盖名词与动词的词义消歧,每个词义均关联至本体,同时包含共指标注。项目当前目标为在五年内完成超过100万词的英语与汉语语料标注,以及50万词的阿拉伯语语料标注。</p><br><h3>数据</h3><br><p>本版本的文档目录中包含标注指南以及从数据库导出数据多视图的相关流程说明。标注数据既按标注层(树库、谓词论元库、词义标注等)单独存储为文本文件,也以集成关系数据库(ontonotes-v4.0.sql.gz)形式提供,并配套Python应用程序编程接口(API)以实现便捷的跨层访问。</p><br><h3>工具</h3><br><p>本版本包含OntoNotes数据库工具v0.999测试版,该工具用于从原始标注文件组装数据库,存放于ontonotes-db-tool-v0.999b目录。该工具可用于从数据库导出数据的多种视图,同时提供可实现自定义查询或视图的API。OntoNotes数据库工具包的授权信息包含在其源码目录中。</p><br><h3>更新说明</h3><br><p>2013年5月21日发布了一项更新,用于修复文件ontonotes-release-4.0/data/files/data/english/annotations/nw/wsj/05/wsj_0560.parse中的部分括号标注错误,此后发布的所有语料均包含该更新。如需获取更多信息或更新文件,请联系<a href="ldc@ldc.upenn.edu" rel="nofollow">ldc@ldc.upenn.edu</a>。</p><br><h3>资助声明</h3><br><p>本项目部分由美国国防高级研究计划局GALE项目资助(资助编号HR0011-06-1-003)。本出版物内容不一定代表政府的立场或政策,不应视为获得官方背书。</p><br><h3>示例</h3><br><ul><br><li><a href="desc/addenda/LDC2009T24_arb.jpg" rel="nofollow">阿拉伯语</a></li><br><li><a href="desc/addenda/LDC2009T24_chn.jpg" rel="nofollow">汉语</a></li><br><li><a href="desc/addenda/LDC2009T24_eng.jpg" rel="nofollow">英语</a></li><br></ul><br><p> </p></br> 部分内容© 2006 阿布扎比电视台、© 2006 法新社、© 2006 《金字塔报》、© 2006 阿拉姆新闻频道、© 2006 阿拉伯电视台、© 2006 《生活报》、© 2006 伊拉克电视台、© 2006 《耶路撒冷报》、© 2006 安徽电视台、© 2002、2006 《每日新闻报》、© 2006 《中东日报》、© 2005 有线电视新闻网有限合伙公司、© 2000-2001 中国广播系统、© 2000-2001、2005-2006 中国中央电视台、© 2006 中国军事在线、© 2000-2001 中国国家广播电台、© 2006 中国新闻网、© 2000-2001 中国电视系统、© 1989 道琼斯公司、© 2006 迪拜电视台、© 2006 《光明日报》、© 2006 科威特电视台、© 2005-2006 美国全国广播公司、© 2006 新唐人电视台、© 2006 尼罗河电视台、© 2006 阿曼电视台、© 2006 PAC有限公司、© 2006 人民网、© 2005-2006 凤凰卫视、© 2000-2001 《观光杂志》、© 2006 叙利亚电视台、© 1996-1998、2006 新华通讯社、© 2007、2008、2009、2011 宾夕法尼亚大学受托人
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
搜集汇总
数据集介绍

背景与挑战
背景概述
OntoNotes Release 4.0是一个多语言(英语、汉语、阿拉伯语)和多领域(新闻、广播、网络文本等)的语料库,包含2.4百万词的标注数据,支持信息提取和检索等应用。
以上内容由遇见数据集搜集并总结生成



