BOLT CTS CallFriend CallHome Egyptian Arabic Transcripts and Translations
收藏DataCite Commons2025-10-01 更新2026-05-03 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2025T14
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3>
<p>BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations, Linguistic Data Consortium (LDC) Catalog Number LDC2025T14, was developed by LDC and consists of transcripts and their corresponding English translations for 116 hours of conversational telephone speech between native speakers of the Arabic dialect spoken in Egypt.</p>
<p>The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The telephone data was transcribed, translated and annotated for various tasks including word alignment, treebanking, and co-reference.</p>
<h3>Data</h3>
<p>The source audio recordings consist of 274 telephone conversations taken from LDC's multilingual CALLFRIEND and CALLHOME series developed to support speech identification and language identification technology development.</p>
<p>Transcribers were required to produce a verbatim transcript of all speech within a file using the <a href="https://aclanthology.org/L12-1328/">CODA</a> orthographic approach; diacritics are not included. Some transcripts include redactions for potential personally identifying information. Further information about the transcription methodology is contained in the transcription guidelines accompanying this release. All speech data was transcribed.</p>
<p>The goal of the BOLT translation task was to translate the Arabic transcripts into fluent English while preserving the meaning present in the original Arabic text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. Further information about the translation methodology is contained in the translation guidelines accompanying this release. 99% of the transcripts were translated into English.</p>
<p>The data volume in this corpus is as follows:</p>
<table border="1" summary="data volume">
<tbody>
<tr>
<td>partition</td>
<td>doc count</td>
<td>su count</td>
<td>src ntoken</td>
<td>eng nword</td>
<td>hours</td>
</tr>
<tr>
<td>dev</td>
<td>29</td>
<td>9,663</td>
<td>63,401</td>
<td>83,206</td>
<td>6.27</td>
</tr>
<tr>
<td>eval</td>
<td>103</td>
<td>39,478</td>
<td>237,623</td>
<td>311,564</td>
<td>23.94</td>
</tr>
<tr>
<td>train</td>
<td>203</td>
<td>134,365</td>
<td>760,536</td>
<td>965,468</td>
<td>78.27</td>
</tr>
<tr>
<td>total</td>
<td>335</td>
<td>183,506</td>
<td>1,061,560</td>
<td>1,360,238</td>
<td>108.48</td>
</tr>
</tbody>
</table>
<p>Transcripts and translations are presented in xml format, UTF-8 encoded.</p>
<h3>Acknowledgement</h3>
<p>This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p>
<h3>Updates</h3>
<p>No updates at this time.</p>
提供机构:
Linguistic Data Consortium
创建时间:
2025-10-01



