Arabic Newswire English Translation Collection
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2009T22
下载链接
链接失效反馈官方服务:
资源简介:
<h4>Introduction</h4><br>
<p>Arabic English Newswire Translation Collection was developed by the Linguistic Data Consortium (LDC) and consists of approximately 550,000 words of Arabic newswire text and its English translation from Agence France Presse (France), An Nahar (Lebanon) and Assabah (Tunisia).</p><br>
<p>The source Arabic text in this release is contained in LDC's Arabic Treebank series, specifically, Part 1 (<a href="http://catalog.ldc.upenn.edu/LDC2003T06" rel="nofollow">Part 1 v. 2.0</a>; <a href="../../../LDC2005T02">Part 1 v. 3.0</a>), Part 3 (<a href="http://catalog.ldc.upenn.edu/LDC2004T11" rel="nofollow">Part 3 v. 1.0</a>; <a href="http://catalog.ldc.upenn.edu/LDC2005T20" rel="nofollow">Part 3 v. 2.0</a>) and Part 4 (<a href="http://catalog.ldc.upenn.edu/LDC2005T30" rel="nofollow">Part 4 v. 1.0</a>). A subset of Agence France Presse (AFP) source text from Arabic Treebank: Part 1 v. 2.0 was previously translated and released by LDC in <a href="http://catalog.ldc.upenn.edu/LDC2003T07" rel="nofollow">Arabic Treebank: Part 1 - 10K-word English Translation, LDC2003T07</a>. Note the 49 translations for this AFP subset are not included in this release, resulting in a total 1,682 translations for the 1,731 source stories.</p><br>
<p>The English translations in this corpus were provided by translation agencies using LDC's Arabic Translation Guidelines. While multiple translations agencies worked on both An Nahar and Assabah sources, for each specific document there is a single translation.</p><br>
<h3>Data</h3><br>
<p>The number of stories and their epochs for each source are as follows:</p><br>
<table><br>
<tbody><br>
<tr><br>
<td>AFP</td><br>
<td>734 stories; July 2000 - November 2000</td><br>
</tr><br>
<tr><br>
<td>An Nahar</td><br>
<td>600 stories; January 2002 - December 2002</td><br>
</tr><br>
<tr><br>
<td>Assabah</td><br>
<td>397 stories; September 2004 - November 2004</td><br>
</tr><br>
<tr><br>
<td>Total</td><br>
<td>1731 stories</td><br>
</tr><br>
</tbody><br>
</table><br>
<p>Word count of Arabic tokens by source is shown in the following table:</p><br>
<table><br>
<tbody><br>
<tr><br>
<td>AFP</td><br>
<td>102,564</td><br>
</tr><br>
<tr><br>
<td>An Nahar</td><br>
<td>299,681</td><br>
</tr><br>
<tr><br>
<td>Assabah</td><br>
<td>149,259</td><br>
</tr><br>
<tr><br>
<td>Total</td><br>
<td>551,504</td><br>
</tr><br>
</tbody><br>
</table><br>
<p>The original source files used different encodings for the Arabic characters, including UTF8 and ASMO. SGML tags were used for marking sentence and paragraph boundaries and for annotating other information about each story. All Arabic source data was converted to UTF and most SGML tags were removed or replaced by "plain text" markers.</p><br>
<h4>Samples</h4><br>
<ul><br>
<li><a href="desc/addenda/LDC2009T22_src.jpg" rel="nofollow">Arabic Source</a></li><br>
<li><a href="desc/addenda/LDC2009T22_trans.jpg" rel="nofollow">English Translation</a></li><br>
</ul></br>
Portions © 2000 Agence-France Presse, © 2002 An Nahar, © 2004 Assabah, © 2002-2005, 2009 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



