five

Arabic Treebank: Part 3 v 3.2

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2010T08
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Arabic Treebank: Part 3 (ATB3) v 3.2 was developed at the Linguistic Data Consortium (LDC). It consists of 599 distinct newswire stories from the Lebanese publication An Nahar with part-of-speech (POS), morphology, gloss and syntactic treebank annotation in accordance with the Penn Arabic Treebank (PATB) Guidelines developed in 2008 and 2009. This release represents a significant revision of LDCs previous ATB3 publications: <a href="http://catalog.ldc.upenn.edu/LDC2004T11" rel="nofollow"> Arabic Treebank: Part 3 v 1.0 LDC2004T11</a> and <a href="http://catalog.ldc.upenn.edu/LDC2005T20" rel="nofollow"> Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis LDC2005T20</a>.</p><br> <p>The ongoing PATB project supports research in Arabic-language natural language processing and human language technology development. The methodology and work leading to the release of this publication are described in detail in the documentation accompanying this corpus and in two research papers, <a href="https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2008-enhancing-arabic-treebank.pdf">Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines </a>and <a href="https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2010-consistent-flexible-integration.pdf">Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank</a>.</p><br> <h3>Data</h3><br> <p>ATB3 v 3.2 contains a total of 339,710 tokens before clitics are split, and 402,291 tokens after clitics are separated for the treebank annotation. This release includes all files that were previously made available to the DARPA GALE program community (Arabic Treebank Part 3 - Version 3.1, LDC2008E22). A number of inconsistencies in the 3.1 release data have been corrected here. These include changes to certain POS tags with the resulting tree changes. As a result, additional clitics have been separated, and some previously incorrectly split tokens have now been merged.</p><br> <p>One file from ATB3 v 2.0, ANN20020715.0063, has been removed from this corpus as that text is an exact duplicate of another file in this release (ANN20020715.0018). This reduces the number of files from 600 files in ATB3 v 2.0 to 599 files in ATB 3 v 3.2.</p><br> <h3>Sponsorship</h3><br> <p>This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p><br> <h3>Sample</h3><br> <p>The included data are available in many different formats and files, as described in detail in the corpus documentation. The following is a screenshot excerpt taken from one of the new integrated data files: <a href="desc/addenda/LDC2010T08.jpg" rel="nofollow">sample</a>.</p></br> Portions © 2002 An Nahar, © 2003, 2004, 2005, 2007, 2008, 2009 2010 Trustees of the University of Pennsylvania

<h3>引言</h3><br> <p>阿拉伯语树库第三部分(Arabic Treebank: Part 3,ATB3)3.2版由语言数据联盟(Linguistic Data Consortium,LDC)开发。该数据集包含来自黎巴嫩出版物《An Nahar》的599篇独特新闻报道,均按照2008年和2009年制定的宾大阿拉伯语树库(Penn Arabic Treebank,PATB)指南进行了词性标注(part-of-speech,POS)、形态学标注、词汇注释及句法树库标注。本版本是对LDC先前发布的ATB3版本的重大修订,包括:<a href="http://catalog.ldc.upenn.edu/LDC2004T11" rel="nofollow">阿拉伯语树库第三部分1.0版(LDC2004T11)</a>和<a href="http://catalog.ldc.upenn.edu/LDC2005T20" rel="nofollow">阿拉伯语树库第三部分(完整语料库)2.0版(MPG+句法分析,LDC2005T20)</a>。</p><br> <p>正在进行的PATB项目为阿拉伯语自然语言处理及人类语言技术开发的研究提供支持。本版本发布相关的方法学与工作细节,详见随语料库附带的文档及两篇研究论文:<a href="https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2008-enhancing-arabic-treebank.pdf">《增强阿拉伯语树库:迈向新标注指南的协作努力》</a>和<a href="https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2010-consistent-flexible-integration.pdf">《阿拉伯语树库中形态学标注的一致且灵活整合》</a>。</p><br> <h3>数据</h3><br> <p>ATB3 3.2版在拆分附着词(clitic)前包含总计339,710个Token,拆分后用于树库标注的Token数量为402,291个。本版本涵盖了此前向DARPA GALE项目社区开放的所有文件(阿拉伯语树库第三部分3.1版,LDC2008E22)。3.1版数据中的若干不一致性已在此版本中得到纠正,包括对部分POS标签的修改及由此产生的树结构变化。因此,更多附着词被拆分,部分此前错误拆分的Token现已合并。</p><br> <p>ATB3 2.0版中的一个文件ANN20020715.0063已从本语料库中移除,因其文本与本版本中的另一个文件ANN20020715.0018完全重复。这使得文件数量从ATB3 2.0版的600个减少至ATB3 3.2版的599个。</p><br> <h3>资助</h3><br> <p>本工作部分得到美国国防高级研究计划局(Defense Advanced Research Projects Agency,DARPA)GALE项目资助(Grant No. HR0011-06-1-0003)。本出版物内容不必然反映政府的立场或政策,不应被视为官方认可。</p><br> <h3>样本</h3><br> <p>包含的数据以多种格式和文件呈现,详见语料库文档。以下是取自某一新整合数据文件的截图片段:<a href="desc/addenda/LDC2010T08.jpg" rel="nofollow">样本</a>。</p></br> Portions © 2002 An Nahar, © 2003, 2004, 2005, 2007, 2008, 2009 2010 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作