five

English-Arabic Treebank v 1.0

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006T10
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>English-Arabic Parallel Treebank v 1.0 was developed by the Linguistic Data Consortium (LDC) and consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories.</p><br> <p>The AFP stories included here correspond to approximately the first 50K words of <a href="../../../LDC2005T02">Arabic Treebank: Part 1 v 3.0 (LDC2005T02)</a>. The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.</p><br> <h3>Data</h3><br> <p>The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences:</p><br> <ol><br> <li>POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization</li><br> <li>TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)</li><br> </ol><br> <p>More detailed addenda to the Penn Treebank II guidelines and a mapping from the original Arabic Treebank filenames to the current filenames used in this release can be found in the associated documentation.</p><br> <h3>Samples</h3><br> <p>For an example of the data in this corpus, please review this text <a href="desc/addenda/LDC2006T10.html" rel="nofollow">sample</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2000 Agence France Presse, © 2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作