five

Arabic Treebank: Part 1 v 2.0

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2003T06
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>Arabic Treebank: Part 1 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T06 and ISBN 1-58563-261-9. This publication is part one of a a corpus of one million words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. </p><h3>Data</h3> <p>The Penn Arabic Treebank, which is part of the DARPA TIDES project, started in the Fall of 2001 with the objective of performing human and computer annotations of a large Arabic machine-readable text corpus (for project background please see <a href="http://www.ldc.upenn.edu/Projects/TIDES/Arabic/data/POS/POStest.html" rel="nofollow">POStest.html</a>). As in previous Penn Treebanks, two different kinds of information need to be produced by two different (human and computer) processes. The Arabic Treebank project consists therefore of two distinct phases: </p><ol> <li>Part-of-Speech (POS) tagging - divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss </li> <li>Arabic Treebanking (ArabicTB) - characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. </li> </ol><p>Both tasks started in November 2001 with an initial pilot consisting of 734 files representing roughly 166K words of written Modern Standard Arabic newswire from the Agence France Presse corpus. </p><p>The target of this publication is to provide a description of a written Modern Standard Arabic text corpus. The source data consists of Agence France Presse (AFP) newswire, spanning from July through November of 2000. This publication includes 734 stories representing 140,265 words (168,123 tokens after clitic segmentation in the Treebank). </p><h3>Updates</h3> <p>There are no updates available at this time. </p> </br> Portions © 2000 Agence France-Presse, © 2002 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作