English-Arabic Treebank v 1.0
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006T10
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>English-Arabic Parallel Treebank v 1.0 was developed by the Linguistic Data Consortium (LDC) and consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories.</p><br>
<p>The AFP stories included here correspond to approximately the first 50K words of <a href="../../../LDC2005T02">Arabic Treebank: Part 1 v 3.0 (LDC2005T02)</a>. The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.</p><br>
<h3>Data</h3><br>
<p>The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences:</p><br>
<ol><br>
<li>POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization</li><br>
<li>TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)</li><br>
</ol><br>
<p>More detailed addenda to the Penn Treebank II guidelines and a mapping from the original Arabic Treebank filenames to the current filenames used in this release can be found in the associated documentation.</p><br>
<h3>Samples</h3><br>
<p>For an example of the data in this corpus, please review this text <a href="desc/addenda/LDC2006T10.html" rel="nofollow">sample</a>.</p><br>
<h3>Updates</h3><br>
<p>None at this time.</p></br>
Portions © 2000 Agence France Presse, © 2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



