Arabic Treebank: Part 3 v 1.0
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2004T11
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>Arabic Treebank: Part 3 v 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T11 and ISBN 1-58563-298-8.</p><br>
<p>This publication is the third part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language resear ch and development of language technology for Modern Standard Arabic. Part one was released in 2003 as <a href="http://catalog.ldc.upenn.edu/LDC2003T06" rel="nofollow">Arabic Treebank: Part 1 v 2.0</a>, having the source data extracted from Agence France Press stories. Part two was released in 2004 as <a href="http://catalog.ldc.upenn.edu/LDC2004T02" rel="nofollow">Arabic Treebank: Part 2 v 2.0</a>, having the source data extracted from Al-Hayat distributed by Ummah. The current Arabic Treebank: Part 3 v 1.0 corpus consists of stories from An Nahar News Agency.</p><br>
<h3>Data</h3><br>
<p>This corpus includes 600 stories from the An Nahar News Text. There are a total of 340,281 words (counting non-Arabic tokens such as numbers and punctuation) in the 600 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.</p><br>
<p>The corpus contains 293,035 Arabic-only word tokens (prior to the separation of clitics), of which 290,842 (99.25%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 2,193 (0.75%) were items that the morphological parser failed to analyze correctly.</p><br>
<h3>Samples</h3><br>
<p>Please view the following samples:</p><br>
<ul><br>
<li><a href="desc/addenda/LDC2004T11.sgm.txt">sgm Sample</a></li><br>
<li><a href="desc/addenda/LDC2004T11.xml">xml Sample</a></li><br>
<li><a href="desc/addenda/LDC2004T11.xml.txt">txt Sample</a></li><br>
</ul><br>
<h3>Updates</h3><br>
<p>There are no updates available at this time.</p></br>
Portions © 2002 An Nahar, © 2003, 2004 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是阿拉伯树库项目的第三部分v1.0版本,包含约30万个阿拉伯语词元,提供详细的词性标注,包括元音标注和词目ID,数据来源于An Nahar新闻社的600篇新闻故事。它旨在支持自然语言处理、信息检索等应用,是用于现代标准阿拉伯语研究的语言资源。
以上内容由遇见数据集搜集并总结生成



