five

Arabic Treebank: Part 2 v 2.0

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2004T02
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Arabic Treebank: Part 2 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T02 and ISBN 1-58563-282-1.</p><br> <p>This publication is the second part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. Part one was released in 2003 as <a href="http://catalog.ldc.upenn.edu/LDC2003T06" rel="nofollow">Arabic Treebank: Part 1 v 2.0</a>, having the source data extracted from Agence France Press stories. The current Arabic Treebank: Part 2 v 2.0 corpus consists of stories from Al-Hayat distributed by Ummah.</p><br> <h3>Data</h3><br> <p>This corpus includes 501 stories from the Ummah Arabic News Text. There are a total of 144,199 words (counting non-Arabic tokens such as numbers and punctuation) in the 501 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.</p><br> <p>The corpus contains 125,698 Arabic-only word tokens (prior to the separation of clitics), of which 124,740 (99.24%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 958 (0.76%) were items that the morphological parser failed to analyze correctly.</p><br> <h3>Samples</h3><br> <p>Please view the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2004T02.sgm">SGML</a></li><br> <li><a href="desc/addenda/LDC2004T02.tree">Treebank</a></li><br> <li><a href="desc/addenda/LDC2004T02.tbnk.xml">Treebank - XML</a></li><br> <li><a href="desc/addenda/LDC2004T02.pos.xml">POS</a></li><br> </ul><br> <h3>Updates</h3><br> <p>There are no updates available at this time.</p></br> Portions © 2001-2002 Ummah Press, © 2004 Trustees of the University of Pennsylvania

<h3>引言</h3><br><p>阿拉伯语树库(Arabic Treebank):第二部分v2.0由语言数据联盟(Linguistic Data Consortium, LDC)制作,目录编号为LDC2004T02,国际标准书号为ISBN 1-58563-282-1。</p><br><p>本数据集为百万词级阿拉伯语树库语料库的第二部分,旨在支撑现代标准阿拉伯语的语言研究与语言技术研发。第一部分于2003年以《阿拉伯语树库:第一部分v2.0》(<a href="http://catalog.ldc.upenn.edu/LDC2003T06" rel="nofollow">Arabic Treebank: Part 1 v 2.0</a>)形式发布,其源数据取自法新社新闻稿件。当前的阿拉伯语树库第二部分v2.0语料库,其数据来自由Ummah分发的《生活报》(Al-Hayat)新闻稿件。</p><br><h3>数据</h3><br><p>本语料库包含来自Ummah阿拉伯新闻文本的501篇新闻稿件。501个文件(每个文件对应一篇稿件)中共计144,199个词(包含数字、标点等非阿拉伯语Token)。新增的标注特性包括完整的元音标注(含格词尾)、词元ID,以及针对动词和小品词的更细粒度词性标注(Part-of-Speech, POS)。</p><br><p>本语料库包含125,698个纯阿拉伯语词Token(未进行附着词拆分前),其中124,740个(占比99.24%)已由形态分析器完成合规的形态分析与词性标注,剩余958个(占比0.76%)则为形态分析器未能正确解析的条目。</p><br><h3>示例</h3><br><p>请查看以下示例文件:</p><br><ul><br><li><a href="desc/addenda/LDC2004T02.sgm">SGML</a></li><br><li><a href="desc/addenda/LDC2004T02.tree">树库(Treebank)</a></li><br><li><a href="desc/addenda/LDC2004T02.tbnk.xml">树库-XML(Treebank - XML)</a></li><br><li><a href="desc/addenda/LDC2004T02.pos.xml">词性标注(POS)</a></li><br></ul><br><h3>更新情况</h3><br><p>目前暂无可用更新。</p><br>部分内容 © 2001-2002 Ummah出版社,© 2004 宾夕法尼亚大学托管委员会
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作