GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2013T17
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3> <p>GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program.</p> <p>Corresponding audio data is released as GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 (<a href="http://catalog.ldc.upenn.edu/LDC2013S07" rel="nofollow">LDC2013S07</a>).</p> <p>The source broadcast conversation recordings feature interviews, call-in programs and round table discussions focusing principally on current events from the following sources: Abu Dhabi TV (based in Abu Dhabi, United Arab Emirates), Al Alam News Channel (based in Iran), Al Arabiya (a news television station based in Dubai), Aljazeera (a regional broadcaster located in Doha, Qatar), Lebanese Broadcasting Corporation (a Lebanese television station), Oman TV (a national broadcaster located in the Sultanate of Oman), Saudi TV (a national television station based in Saudi Arabia) and Syria TV, the national television station in Syria.</p> <h3>Data</h3> <p>The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 763,945 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. XTrans is available from the following link, <a href="http://www.ldc.upenn.edu/tools/XTrans/downloads/" rel="nofollow">http://www.ldc.upenn.edu/tools/XTrans/downloads/</a>. </p> <p>The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDCs quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.</p> <p>LDC has also released GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 (<a href="http://catalog.ldc.upenn.edu/LDC2013S02" rel="nofollow">LDC2013S02</a> ) and GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 (<a href="http://catalog.ldc.upenn.edu/LDC2013T04" rel="nofollow">LDC2013T04</a>).</p> <h3>Samples</h3> <p>Please view the following <a href="./desc/addenda/LDC2013T17.jpg" rel="nofollow">transcript sample</a>.</p> <h3>Updates</h3> <p> None at this time. </p> <h3>Acknowledgement</h3> <p>This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p> </br>
Portions © 2007 Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Aljazeera, Oman TV, PAC Ltd, Saudi TV, Syria TV, © 2007, 2013 Trustees of the University of Pennsylvania
<h3>简介</h3> <p>GALE第二阶段阿拉伯语广播会话转录文本(第二部分)由语言数据联盟(Linguistic Data Consortium,简称LDC)开发,包含2007年由LDC、突尼斯突尼斯市的MediaNet、摩洛哥拉巴特的MTC在美国国防高级研究计划局(Defense Advanced Research Projects Agency,简称DARPA)全球自主语言利用项目(Global Autonomous Language Exploitation,简称GALE)第二阶段期间采集的约128小时阿拉伯语广播会话语音的转录文本。</p> <p>对应的音频数据以《GALE第二阶段阿拉伯语广播会话语音(第二部分)》(<a href="http://catalog.ldc.upenn.edu/LDC2013S07" rel="nofollow">LDC2013S07</a>)形式发布。</p> <p>该源广播会话录音涵盖访谈、来电节目及圆桌讨论,内容主要聚焦于以下来源的时事新闻:总部位于阿拉伯联合酋长国阿布扎比的阿布扎比电视台(Abu Dhabi TV)、伊朗境内的阿拉姆新闻频道(Al Alam News Channel)、迪拜的阿拉伯电视台(Al Arabiya)、卡塔尔多哈的区域性广播机构半岛电视台(Aljazeera)、黎巴嫩国家电视台黎巴嫩广播公司(Lebanese Broadcasting Corporation)、阿曼苏丹国的国家广播机构阿曼电视台(Oman TV)、沙特阿拉伯的国家电视台沙特电视台(Saudi TV)以及叙利亚国家电视台叙利亚电视台(Syria TV)。</p> <h3>数据</h3> <p>转录文本文件采用纯文本制表符分隔格式(Tab-Delimited Format,简称TDF),编码为UTF-8,转录数据总计763,945个Token。该转录文本由LDC开发的转录工具XTrans生成,XTrans是一款跨平台、多语言、多通道的音频录音手动转录与标注工具。XTrans可通过以下链接获取:<a href="http://www.ldc.upenn.edu/tools/XTrans/downloads/" rel="nofollow">http://www.ldc.upenn.edu/tools/XTrans/downloads/</a>。</p> <p>该语料库中的文件由LDC工作人员或与LDC签约的转录服务商完成转录。转录人员遵循LDC的快速转录指南(Quick Transcription Guidelines,简称QTR)与快速富转录规范(Quick Rich Transcription Specification,简称QRTR),二者均包含于本次发布的文档中。QTR转录为快速(近乎逐字)的时间对齐转录文本,附带说话人识别信息,仅含少量额外标记,不包含句子单元标注。QRTR标注则在快速转录的核心组件基础上,新增了主题边界、手动句子单元标注等结构化信息。文件名中包含QTR的文件采用QTR转录方式生成,文件名中包含QRTR的文件则采用QRTR转录方式。</p> <p>LDC还发布了《GALE第二阶段阿拉伯语广播会话语音(第一部分)》(<a href="http://catalog.ldc.upenn.edu/LDC2013S02" rel="nofollow">LDC2013S02</a>)与《GALE第二阶段阿拉伯语广播会话转录文本(第一部分)》(<a href="http://catalog.ldc.upenn.edu/LDC2013T04" rel="nofollow">LDC2013T04</a>)。</p> <h3>示例</h3> <p>请查看以下<a href="./desc/addenda/LDC2013T17.jpg" rel="nofollow">转录文本示例</a>。</p> <h3>更新说明</h3> <p>暂无更新。</p> <h3>致谢</h3> <p>本工作部分由美国国防高级研究计划局GALE项目资助,资助编号为HR0011-06-1-0003。本文档内容不一定反映政府的立场或政策,不应视为获得官方背书。</p> <p>部分内容 © 2007 阿布扎比电视台、阿拉姆新闻频道、阿拉伯电视台、半岛电视台、阿曼电视台、PAC有限公司、沙特电视台、叙利亚电视台,© 2007、2013 宾夕法尼亚大学理事会</p>
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



