Arabic Treebank: Part 3 v 1.0

Name: Arabic Treebank: Part 3 v 1.0
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:16:49
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2004T11

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>Arabic Treebank: Part 3 v 1.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T11 and ISBN 1-58563-298-8.</p><br> <p>This publication is the third part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language resear ch and development of language technology for Modern Standard Arabic. Part one was released in 2003 as <a href="http://catalog.ldc.upenn.edu/LDC2003T06" rel="nofollow">Arabic Treebank: Part 1 v 2.0</a>, having the source data extracted from Agence France Press stories. Part two was released in 2004 as <a href="http://catalog.ldc.upenn.edu/LDC2004T02" rel="nofollow">Arabic Treebank: Part 2 v 2.0</a>, having the source data extracted from Al-Hayat distributed by Ummah. The current Arabic Treebank: Part 3 v 1.0 corpus consists of stories from An Nahar News Agency.</p><br> <h3>Data</h3><br> <p>This corpus includes 600 stories from the An Nahar News Text. There are a total of 340,281 words (counting non-Arabic tokens such as numbers and punctuation) in the 600 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles.</p><br> <p>The corpus contains 293,035 Arabic-only word tokens (prior to the separation of clitics), of which 290,842 (99.25%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 2,193 (0.75%) were items that the morphological parser failed to analyze correctly.</p><br> <h3>Samples</h3><br> <p>Please view the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2004T11.sgm.txt">sgm Sample</a></li><br> <li><a href="desc/addenda/LDC2004T11.xml">xml Sample</a></li><br> <li><a href="desc/addenda/LDC2004T11.xml.txt">txt Sample</a></li><br> </ul><br> <h3>Updates</h3><br> <p>There are no updates available at this time.</p></br> Portions © 2002 An Nahar, © 2003, 2004 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

搜集汇总

数据集介绍