five

Prague Arabic Dependency Treebank 1.0

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2004T23
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p><a href="https://ufal.mff.cuni.cz/padt/PADT_1.0/docs/index.html">Prague Arabic Dependency Treebank</a> (PADT) not only consists of multi-level linguistic annotations over the language of Modern Standard Arabic, but even provides a variety of unique software implementations designed for general use in Natural Language Processing (NLP).</p><br> <p>The PADT project might be summarized as an open-ended activity of the <a href="http://ckl.mff.cuni.cz/" rel="nofollow">Center for Computational Linguistics</a>, the <a href="http://ufal.mff.cuni.cz/" rel="nofollow">Institute of Formal and Applied Linguistics</a>, and the <a href="http://enlil.ff.cuni.cz/" rel="nofollow">Institute of Comparative Linguistics</a>, <a href="http://www.cuni.cz/" rel="nofollow">Charles University in Prague</a>, resting in multi-level annotation of Arabic language resources in the light of the theory of Functional Generative Description . The project is a younger sibling to <a href="http://ufal.mff.cuni.cz/pdt/" rel="nofollow">Prague Dependency Treebank</a> for Czech, and is maintained upon co-operation with the <a href="http://www.ldc.upenn.edu/" rel="nofollow">Linguistic Data Consortium</a>, <a href="http://www.upenn.edu/" rel="nofollow">University of Pennsylvania</a>, who release non-annotated corpora of Arabic newswire and develop an independent <a href="http://www.ircs.upenn.edu/arabic/" rel="nofollow">Penn Arabic Treebank</a>.</p><br> <h3>Data</h3><br> <p>The corpus of PADT 1.0 consists of morphologically and analytically annotated newswire texts of Modern Standard Arabic, which originate from the <a href="http://catalog.ldc.upenn.edu/LDC2003T12" rel="nofollow">Arabic Gigaword</a> and the plain data of <a href="http://catalog.ldc.upenn.edu/LDC2003T06" rel="nofollow">Penn Arabic Treebank, Part 1</a> and <a href="http://catalog.ldc.upenn.edu/LDC2004T02" rel="nofollow">Penn Arabic Treebank, Part 2</a>.</p><br> <p>The PADT 1.0 distribution comprises over <strong>113,500 tokens</strong> of data annotated analytically and provided with the disambiguated morphological information. In addition, the release includes complete annotations of MorphoTrees resulting in more than <strong>148,000 tokens</strong>, 49,000 of which have received the analytical processing. The contents are further divided into data sets as indicated in the Table.</p><br> <table><br> <tbody><br> <tr><br> <th>Data Set</th><br> <th colspan="2">[A] Tokens [M]</th><br> <th>Tokens/Para</th><br> <th>Tokens/Doc</th><br> <th>Original Data Provider</th><br> <th>News Period</th><br> <th>Related Corpora</th><br> </tr><br> <tr><br> <td><a href="data/AFP/" rel="nofollow">AFP</a></td><br> <td>13,000</td><br> <td>N/A</td><br> <td>34.6 [N/A]</td><br> <td>260 [N/A]</td><br> <td>Agence France Presse</td><br> <td>July 2000</td><br> <td>Penn ATB Part 1</td><br> </tr><br> <tr><br> <td><a href="data/UMH/" rel="nofollow">UMH</a></td><br> <td>38,500</td><br> <td>N/A</td><br> <td>43.6 [N/A]</td><br> <td>290 [N/A]</td><br> <td>Ummah Press Service</td><br> <td>Spring 2002</td><br> <td>Penn ATB Part 2</td><br> </tr><br> <tr><br> <td><a href="data/XIN/" rel="nofollow">XIN</a></td><br> <td>13,500</td><br> <td>N/A</td><br> <td>31.2 [N/A]</td><br> <td>155 [N/A]</td><br> <td>Xinhua News Agency</td><br> <td>May 2003</td><br> <td>Arabic Gigaword</td><br> </tr><br> <tr><br> <td><a href="data/ALH/" rel="nofollow">ALH</a></td><br> <td>10,000</td><br> <td>73,500</td><br> <td>47.0 [47.8]</td><br> <td>405 [405]</td><br> <td>Al Hayat News Agency</td><br> <td>September 2001</td><br> <td>Arabic Gigaword</td><br> </tr><br> <tr><br> <td><a href="data/ANN/" rel="nofollow">ANN</a></td><br> <td>12,500</td><br> <td>25,500</td><br> <td>60.3 [50.3]</td><br> <td>740 [630]</td><br> <td>An Nahar News Agency</td><br> <td>November 2002</td><br> <td>Arabic Gigaword</td><br> </tr><br> <tr><br> <td><a href="data/XIA/" rel="nofollow">XIA</a></td><br> <td>26,500</td><br> <td>49,500</td><br> <td>29.7 [25.9]</td><br> <td>235 [205]</td><br> <td>Xinhua News Agency</td><br> <td>May 2003</td><br> <td>Arabic Gigaword</td><br> </tr><br> </tbody><br> </table><br> <p>In the Table, tokens give the number of syntactic units that are annotated [A] analytically [M] within MorphoTrees. Approximate ratios of tokens per paragraph and tokens per document come in the next columns, distinguishing the two types of annotation. The sets of selected documents could cover only a couple of days of the specified period of time.</p><br> <h3>Samples</h3><br> <p><a href="desc/addenda/LDC2004T23_1.gif" rel="nofollow">Preview of paragraph morphology tree. </a><a href="desc/addenda/LDC2004T23_2.gif" rel="nofollow">New analytical rendering style.</a></p><br> <h3>Support</h3><br> <p>PADT 1.0 was supported by the <a href="http://www.msmt.cz/" rel="nofollow">Ministry of Education of the Czech Republic</a>, projects LN00A063 and MSM113200006, and by the <a href="http://www.gacr.cz/" rel="nofollow">Grant Agency of the Czech Republic</a>, project 405/02/0823.</p><br> <h3>Updates</h3><br> <p>Updates or bug fixes may be available in the LDC catalog entry for this corpus, or at the <a href="http://ckl.mff.cuni.cz/padt/" rel="nofollow">PADT website</a>.</p><br> <p>Your questions and suggestions are welcome at <a rel="nofollow">padt (at) ckl (dot) mff (dot) cuni (dot) cz</a>.</p></br> Portions © 2000 Agence France Presse, © 2001 Al Hayat, © 2002 An Nahar, © 2002 Ummah Press Service, © 2003 Xinhua News Agency, © 2000-2004 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作