five

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2015T09
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 112 hours of Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.</p><br> <p>Corresponding audio data is released as GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 (<a href="../../../LDC2015S06">LDC2015S06</a>). Part 1 of this release is GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 (<a href="../../../LDC2014T28">LDC2014T28</a>). The corresponding part one audio is released as GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 (<a href="../../../LDC2014S09">LDC2014S09</a>).</p><br> <p>The broadcast conversation recordings feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Beijing TV, a national television station in Mainland China; China Central TV, a national and international broadcaster in Mainland China; Hubei TV, a regional television station in Mainland China, Hubei Province; Phoenix TV, a Hong Kong-based satellite television station; and Voice of America, a U.S. government-funded broadcast programmer.</p><br> <h3>Data</h3><br> <p>The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,388,236 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. XTrans is available from the following link, <a href="https://www.ldc.upenn.edu/language-resources/tools/xtrans">https://www.ldc.upenn.edu/language-resources/tools/xtrans</a>.</p><br> <p>The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2015T09.txt">text sample</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p><br> <h3>Acknowledgement</h3><br> <p>This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p></br> Portions © 2007 Beijing TV, © 2007 China Central TV, © 2007 Hubei TV, © 2007, 2008 Phoenix TV, © 2007, 2008, 2011, 2015 Trustees of the University of Pennsylvania

<h3>引言</h3><br> <p>GALE第三阶段中文广播对话转录文本第二部分由语言数据联盟(Linguistic Data Consortium,LDC)开发,包含约112小时的中文广播对话语音转录文本。这些语音数据由LDC与香港的香港科技大学(Hong University of Science and Technology,HKUST)在DARPA GALE(Global Autonomous Language Exploitation,全球自主语言利用)计划第三阶段期间于2007年和2008年收集。</p><br> <p>对应的音频数据以GALE第三阶段中文广播对话语音第二部分(<a href="../../../LDC2015S06">LDC2015S06</a>)的形式发布。本系列的第一部分为GALE第三阶段中文广播对话转录文本第一部分(<a href="../../../LDC2014T28">LDC2014T28</a>),其对应的音频数据为GALE第三阶段中文广播对话语音第一部分(<a href="../../../LDC2014S09">LDC2014S09</a>)。</p><br> <p>广播对话录音涵盖访谈、听众来电节目和圆桌讨论,内容主要聚焦当前事件,来源包括:中国大陆的国家级电视台北京电视台;中国大陆的国家级和国际广播机构中国中央电视台;中国大陆湖北省的区域电视台湖北电视台;香港的卫星电视台凤凰卫视;以及美国政府资助的广播机构美国之音。</p><br> <h3>数据</h3><br> <p>转录文件采用纯文本制表符分隔格式(TDF),编码为UTF-8,转录数据总计1,388,236个Token。转录文本通过LDC开发的转录工具XTrans创建,该工具是一款跨平台、多语言、多通道的转录工具,支持音频录音的手动转录与标注。XTrans可通过以下链接获取:<a href="https://www.ldc.upenn.edu/language-resources/tools/xtrans">https://www.ldc.upenn.edu/language-resources/tools/xtrans</a>。</p><br> <p>本语料库中的文件由LDC工作人员和/或与LDC签订合同的转录供应商完成。转录人员遵循LDC制定的快速转录指南(quick transcription guidelines,QTR)和快速丰富转录规范(quick rich transcription specification,QRTR),两者均包含在本发布版本的文档中。QTR转录包括快速(近)逐字、时间对齐的转录文本,外加说话人识别及最少附加标记,不包含句子单元标注。QRTR标注则在快速转录的核心组件基础上,增加了主题边界和手动句子单元标注等结构信息。文件名中包含QTR的文件采用QTR转录方式,包含QRTR的文件则采用QRTR转录方式。</p><br> <h3>样本</h3><br> <p>请查看此<a href="desc/addenda/LDC2015T09.txt">文本样本</a>。</p><br> <h3>更新</h3><br> <p>目前无更新。</p><br> <h3>致谢</h3><br> <p>本工作部分由美国国防高级研究计划局(Defense Advanced Research Projects Agency,DARPA)GALE计划资助,grant编号HR0011-06-1-0003。本出版物内容不代表政府立场或政策,不应被视为官方认可。</p></br> Portions © 2007 Beijing TV, © 2007 China Central TV, © 2007 Hubei TV, © 2007, 2008 Phoenix TV, © 2007, 2008, 2011, 2015 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作