Arabic Treebank: Part 2 v 2.0

Name: Arabic Treebank: Part 2 v 2.0
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:16:37
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2004T02

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>Arabic Treebank: Part 2 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T02 and ISBN 1-58563-282-1.</p><br> <p>This publication is the second part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. Part one was released in 2003 as <a href="http://catalog.ldc.upenn.edu/LDC2003T06" rel="nofollow">Arabic Treebank: Part 1 v 2.0</a>, having the source data extracted from Agence France Press stories. The current Arabic Treebank: Part 2 v 2.0 corpus consists of stories from Al-Hayat distributed by Ummah.</p><br> <h3>Data</h3><br> <p>This corpus includes 501 stories from the Ummah Arabic News Text. There are a total of 144,199 words (counting non-Arabic tokens such as numbers and punctuation) in the 501 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.</p><br> <p>The corpus contains 125,698 Arabic-only word tokens (prior to the separation of clitics), of which 124,740 (99.24%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 958 (0.76%) were items that the morphological parser failed to analyze correctly.</p><br> <h3>Samples</h3><br> <p>Please view the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2004T02.sgm">SGML</a></li><br> <li><a href="desc/addenda/LDC2004T02.tree">Treebank</a></li><br> <li><a href="desc/addenda/LDC2004T02.tbnk.xml">Treebank - XML</a></li><br> <li><a href="desc/addenda/LDC2004T02.pos.xml">POS</a></li><br> </ul><br> <h3>Updates</h3><br> <p>There are no updates available at this time.</p></br> Portions © 2001-2002 Ummah Press, © 2004 Trustees of the University of Pennsylvania

<h3>引言</h3><br><p>阿拉伯语树库（Arabic Treebank）：第二部分v2.0由语言数据联盟（Linguistic Data Consortium, LDC）制作，目录编号为LDC2004T02，国际标准书号为ISBN 1-58563-282-1。</p><br><p>本数据集为百万词级阿拉伯语树库语料库的第二部分，旨在支撑现代标准阿拉伯语的语言研究与语言技术研发。第一部分于2003年以《阿拉伯语树库：第一部分v2.0》（<a href="http://catalog.ldc.upenn.edu/LDC2003T06" rel="nofollow">Arabic Treebank: Part 1 v 2.0</a>）形式发布，其源数据取自法新社新闻稿件。当前的阿拉伯语树库第二部分v2.0语料库，其数据来自由Ummah分发的《生活报》（Al-Hayat）新闻稿件。</p><br><h3>数据</h3><br><p>本语料库包含来自Ummah阿拉伯新闻文本的501篇新闻稿件。501个文件（每个文件对应一篇稿件）中共计144,199个词（包含数字、标点等非阿拉伯语Token）。新增的标注特性包括完整的元音标注（含格词尾）、词元ID，以及针对动词和小品词的更细粒度词性标注（Part-of-Speech, POS）。</p><br><p>本语料库包含125,698个纯阿拉伯语词Token（未进行附着词拆分前），其中124,740个（占比99.24%）已由形态分析器完成合规的形态分析与词性标注，剩余958个（占比0.76%）则为形态分析器未能正确解析的条目。</p><br><h3>示例</h3><br><p>请查看以下示例文件：</p><br><ul><br><li><a href="desc/addenda/LDC2004T02.sgm">SGML</a></li><br><li><a href="desc/addenda/LDC2004T02.tree">树库（Treebank）</a></li><br><li><a href="desc/addenda/LDC2004T02.tbnk.xml">树库-XML（Treebank - XML）</a></li><br><li><a href="desc/addenda/LDC2004T02.pos.xml">词性标注（POS）</a></li><br></ul><br><h3>更新情况</h3><br><p>目前暂无可用更新。</p><br>部分内容 © 2001-2002 Ummah出版社，© 2004 宾夕法尼亚大学托管委员会

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集