GALE Phase 2 Arabic Newswire Parallel Text
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2012T17
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>GALE Phase 2 Arabic Newswire Parallel Text was developed by the Linguistic Data Consortium (LDC). Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from newswire data collected in 2007 by LDC and transcribed by LDC or under its direction.</p><br>
<p>LDC has released the following GALE Phase 1 & 2 Arabic Parallel Text data sets:</p><br>
<ul><br>
<li>GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 (<a href="../../../LDC2007T24">LDC2007T24</a>)</li><br>
<li>GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 (<a href="../../../LDC2008T09">LDC2008T09</a>)</li><br>
<li>GALE Phase 1 Arabic Blog Parallel Text (<a href="../../../LDC2008T02">LDC2008T02</a>)</li><br>
<li>GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (<a href="../../../LDC2009T03">LDC2009T03</a>)</li><br>
<li>GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (<a href="../../../LDC2009T09">LDC2009T09</a>)</li><br>
<li>GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 (<a href="../../../LDC2012T06">LDC2012T06</a>)</li><br>
<li>GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 (<a href="../../../LDC2012T14">LDC2012T14</a>)</li><br>
<li>GALE Phase 2 Arabic Newswire Parallel Text (<a href="../../../LDC2012T17">LDC2012T17</a>)</li><br>
<li>GALE Phase 2 Arabic Broadcast News Parallel Text (<a href="../../../LDC2012T18">LDC2012T18</a>)</li><br>
<li>GALE Phase 2 Arabic Web Parallel Text (<a href="../../../LDC2013T01">LDC2013T01</a>)</li><br>
</ul><br>
<h3>Data</h3><br>
<p>GALE Phase 2 Arabic Newswire Parallel Text includes 400 source-translation pairs, comprising 181,704 tokens of Arabic source text and its English translation. Data is drawn from six distinct Arabic newswire sources.: Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.</p><br>
<p>Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were formatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.</p><br>
<p>Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.</p><br>
<h3>Samples</h3><br>
<p>Please consult this <a href="desc/addenda/LDC2012T17.arb.png" rel="nofollow">Arabic sample</a> and <a href="desc/addenda/LDC2012T17.eng.png" rel="nofollow">English sample</a>.</p><br>
<h3>Sponsorship</h3><br>
<p>This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p><br>
<h3>Updates</h3><br>
<p>None at this time.</p></br>
Portions © 2007 Al-Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat, Assabah, © 2007, 2012 Trustees of the University of Pennsylvania
<h3>引言</h3><br>
<p>GALE二期阿拉伯语新闻专线平行文本由语言数据联盟(Linguistic Data Consortium,LDC)开发。该版本中的平行文本与其他语料库一同构成了DARPA GALE(Global Autonomous Language Exploitation,全球自主语言开发)计划二期的训练数据。本语料库包含现代标准阿拉伯语源文本及其对应的英语译文,这些文本选自LDC于2007年收集、并由LDC或其指导下转录的新闻专线数据。</p><br>
<p>LDC已发布以下GALE一期与二期阿拉伯语平行文本数据集:</p><br>
<ul><br>
<li>GALE一期阿拉伯语广播新闻平行文本-第一部分(<a href="../../../LDC2007T24">LDC2007T24</a>)</li><br>
<li>GALE一期阿拉伯语广播新闻平行文本-第二部分(<a href="../../../LDC2008T09">LDC2008T09</a>)</li><br>
<li>GALE一期阿拉伯语博客平行文本(<a href="../../../LDC2008T02">LDC2008T02</a>)</li><br>
<li>GALE一期阿拉伯语新闻组平行文本-第一部分(<a href="../../../LDC2009T03">LDC2009T03</a>)</li><br>
<li>GALE一期阿拉伯语新闻组平行文本-第二部分(<a href="../../../LDC2009T09">LDC2009T09</a>)</li><br>
<li>GALE二期阿拉伯语广播对话平行文本-第一部分(<a href="../../../LDC2012T06">LDC2012T06</a>)</li><br>
<li>GALE二期阿拉伯语广播对话平行文本-第二部分(<a href="../../../LDC2012T14">LDC2012T14</a>)</li><br>
<li>GALE二期阿拉伯语新闻专线平行文本(<a href="../../../LDC2012T17">LDC2012T17</a>)</li><br>
<li>GALE二期阿拉伯语广播新闻平行文本(<a href="../../../LDC2012T18">LDC2012T18</a>)</li><br>
<li>GALE二期阿拉伯语网络平行文本(<a href="../../../LDC2013T01">LDC2013T01</a>)</li><br>
</ul><br>
<h3>数据</h3><br>
<p>GALE二期阿拉伯语新闻专线平行文本包含400个源文本-译文对,涵盖181,704个阿拉伯语源文本Token及其英语译文。数据来源于六个不同的阿拉伯语新闻专线来源:Al Ahram、Al Hayat、Al-Quds Al-Arabi、An Nahar、Asharq Al-Awsat和Assabah。</p><br>
<p>数据是根据语言特征和主题特征等多项标准人工筛选用于翻译的。文件被格式化为人类可读的翻译格式,并分配给翻译供应商。译者遵循LDC的阿拉伯语到英语翻译指南。LDC的双语工作人员对完成的译文执行了质量控制流程。</p><br>
<p>源数据和译文以TDF格式分发。TDF文件是制表符分隔的文件,包含一段文本及其相关元信息。TDF文件中的每个字段在TDF_format.text中均有描述。所有数据均采用UTF-8编码。</p><br>
<h3>样本</h3><br>
<p>请参考此<a href="desc/addenda/LDC2012T17.arb.png" rel="nofollow">阿拉伯语样本</a>和<a href="desc/addenda/LDC2012T17.eng.png" rel="nofollow">英语样本</a>。</p><br>
<h3>资助</h3><br>
<p>本工作部分由国防高级研究计划局(Defense Advanced Research Projects Agency,DARPA)GALE计划资助,授权编号HR0011-06-1-0003。本出版物内容不一定反映政府的立场或政策,不应推断为官方认可。</p><br>
<h3>更新</h3><br>
<p>目前无更新。</p></br>
部分内容©2007 Al-Ahram、Al Hayat、Al-Quds Al-Arabi、An Nahar、Asharq Al-Awsat、Assabah,©2007、2012宾夕法尼亚大学董事会
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



