GALE Phase 2 Chinese Broadcast Conversation Speech

Name: GALE Phase 2 Chinese Broadcast Conversation Speech
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:25:00
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2013S04

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>GALE Phase 2 Chinese Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 120 hours of Chinese broadcast conversation speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.</p><br> <p>Corresponding transcripts are released as GALE Phase 2 Chinese Broadcast Conversation Transcripts (<a href="http://catalog.ldc.upenn.edu/LDC2013T08" rel="nofollow">LDC2013T08</a>).</p><br> <p>Broadcast audio for the GALE program was collected at the Philadelphia, PA USA facilities of LDC and at three remote collection sites: HKUST (Chinese) Medianet, Tunis, Tunisia (Arabic) and MTC, Rabat, Morocco (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.</p><br> <p>The LDC local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.</p><br> <p>LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.</p><br> <p>HKUST collected Chinese broadcast programming using its internal recording system and a portable broadcast collection platform designed by LDC and installed at HKUST in 2006.</p><br> <h3>Data</h3><br> <p>The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province, China Central TV (CCTV), a national and international broadcaster in Mainland China, Hubei TV, a regional broadcaster in Mainland China, Hubei Province, and Phoenix TV, a Hong Kong-based satellite television station. A table showing the number of programs and hours recorded from each source is contained in the readme file.</p><br> <p>This release contains 202 audio files presented in Waveform Audio File format (.wav), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: (1) as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, (2) as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded and (3) as a guide for data selection by retaining information about the genre, data type and topic of a program.</p><br> <h3>Samples</h3><br> <p>Please listen to this <a href="desc/addenda/LDC2013S04.wav" rel="nofollow">audio sample</a>.</p><br> <h3>Updates</h3><br> <p><strong>February 1st, 2016: </strong>All wav files were converted to flac.</p><br> <h3>Acknowledgement</h3><br> <p>This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p></br> Portions © 2006-2007 Anhui TV, China Central TV, Hubei TV, Phoenix TV, © 2006-2007, 2011, 2013 Trustees of the University of Pennsylvania

<h3>简介</h3><br> <p>GALE第二阶段中文广播对话语音数据集由语言数据联盟（Linguistic Data Consortium, LDC）开发，由其与香港科技大学（Hong Kong University of Science and Technology, HKUST）于2006至2007年间，在美国国防高级研究计划局（Defense Advanced Research Projects Agency, DARPA）全球自主语言开发（Global Autonomous Language Exploitation, GALE）项目第二阶段中采集的约120小时中文广播对话语音构成。</p><br> <p>对应的转录文本以《GALE第二阶段中文广播对话转录文本》（<a href="http://catalog.ldc.upenn.edu/LDC2013T08" rel="nofollow">LDC2013T08</a>）形式发布。</p><br> <p>GALE项目的广播音频采集工作分别在位于美国宾夕法尼亚州费城的LDC总部，以及三个远程采集站点开展：香港科技大学（中文内容）媒体网络站点、突尼斯突尼斯城的（阿拉伯语内容）站点，以及摩洛哥拉巴特的MTC（阿拉伯语内容）站点。本次本地与外包广播采集工作累计为GALE项目提供每周约300小时的节目内容，数据源覆盖50余个广播渠道，项目全周期累计采集广播音频时长超30000小时。</p><br> <p>LDC的本地广播采集系统具备高度自动化、易扩展与高鲁棒性的特点，可每日从数十个数据源采集、处理并评估数百小时的内容。该系统通过免费空中传输（Free-to-air, FTA）卫星接收机、DirecTV等商用直接卫星系统（Direct Satellite Systems, DSS）、直播卫星（Direct Broadcast Satellite, DBS）接收机以及有线电视（Cable Television, CATV）信号源获取广播素材。接收机与录制设备间的映射关系采用动态模块化设计，所有信号路由均通过256×64音视频矩阵切换器在计算机控制下完成。节目以高带宽音视频格式录制，随后经处理提取音频、生成关键帧与压缩音视频、生成时间同步的隐藏字幕（针对北美英语内容），并输出自动语音识别（Automatic Speech Recognition, ASR）结果。本数据集附带的《广播音频采集指南3.0版》中，详细介绍了该系统、采集数据源与录制实验室的配置情况。</p><br> <p>LDC设计了一款便携式远程广播采集平台，该平台采用类TiVO的数字视频录制（Digital Video Recording, DVR）系统，可同时录制两路音视频流。其支持模拟有线电视（NTSC与PAL制式）与免费空中传输DVB-S卫星节目，可在美国境外部署。该平台占地面积小，重量不足30磅，可作为随身行李携带。</p><br> <p>香港科技大学通过其内部录制系统，以及2006年安装于该校的LDC自研便携式广播采集平台，完成中文广播节目采集工作。</p><br> <h3>数据</h3><br> <p>本次发布的广播对话录音内容涵盖访谈、热线节目与圆桌讨论，主题主要聚焦时事新闻，数据源包括：中国安徽省地方电视台安徽卫视、中国内地国家级国际传媒机构中国中央电视台（China Central Television, CCTV）、中国湖北省地方电视台湖北卫视，以及总部位于香港的卫星电视台凤凰卫视。各数据源的节目数量与录制时长统计表格详见本数据集附带的readme文件。</p><br> <p>本次发布包含202个波形音频文件格式（Waveform Audio File Format, .wav）的音频文件，采样率为16000Hz、单声道、16位脉冲编码调制（Pulse Code Modulation, PCM）。所有文件均由母语为中文的语音审核人员参照本数据集附带的《审核流程规范2.0版》完成人工审核。广播音频审核流程主要实现三大目标：（1）通过识别失效、不完整或存在缺陷的录音，校验广播采集系统设备的运行状态；（2）通过识别误录节目，监控广播节目表的变更情况；（3）通过记录节目的体裁、数据类型与主题，为后续数据筛选提供依据。</p><br> <h3>示例</h3><br> <p>请收听<a href="desc/addenda/LDC2013S04.wav" rel="nofollow">本音频示例</a>。</p><br> <h3>更新记录</h3><br> <p><strong>2016年2月1日：</strong>所有wav格式音频文件均已转换为FLAC格式。</p><br> <h3>致谢</h3><br> <p>本项目部分经费由美国国防高级研究计划局GALE项目（项目编号HR0011-06-1-0003）资助。本出版物的内容不一定代表美国政府的立场或政策，不应被视为获得官方背书。</p><br> <p>部分内容 © 2006-2007 安徽卫视、中国中央电视台、湖北卫视、凤凰卫视，© 2006-2007、2011、2013 宾夕法尼亚大学董事会。</p>

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集