English-Arabic Treebank v 1.0

Name: English-Arabic Treebank v 1.0
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:18:57
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2006T10

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>English-Arabic Parallel Treebank v 1.0 was developed by the Linguistic Data Consortium (LDC) and consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories.</p><br> <p>The AFP stories included here correspond to approximately the first 50K words of <a href="../../../LDC2005T02">Arabic Treebank: Part 1 v 3.0 (LDC2005T02)</a>. The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.</p><br> <h3>Data</h3><br> <p>The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences:</p><br> <ol><br> <li>POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization</li><br> <li>TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)</li><br> </ol><br> <p>More detailed addenda to the Penn Treebank II guidelines and a mapping from the original Arabic Treebank filenames to the current filenames used in this release can be found in the associated documentation.</p><br> <h3>Samples</h3><br> <p>For an example of the data in this corpus, please review this text <a href="desc/addenda/LDC2006T10.html" rel="nofollow">sample</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2000 Agence France Presse, © 2006 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集