five

GALE Phase 3 Arabic Broadcast Conversation Speech Part 1

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2015S11
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.</p><br> <p>Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 (<a href="../../../LDC2015T16">LDC2015T16</a>).</p><br> <p>Broadcast audio for the GALE program was collected at LDC&rsquo;s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.</p><br> <p>LDC&rsquo;s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.</p><br> <p>LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.</p><br> <p>Medianet collected Arabic programming from across the Gulf region using its internal system and LDC's portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.</p><br> <h3>Data</h3><br> <p>The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Lebanese Broadcasting Corporation, a Lebanese television station; Oman TV, a national broadcaster located in the Sultanate of Oman; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.</p><br> <p>This release contains 149 audio files presented in <a href="http://flac.sourceforge.net">FLAC</a>-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program&rsquo;s genre, data type and topic.</p><br> <h3>Samples</h3><br> <p>Please listen to this <a href="desc/addenda/LDC2015S11.wav">audio sample</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p><br> <h3>Acknowledgment</h3><br> <p>This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p></br> Portions © 2007 Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Aljazeera, Al Ordiniyah, Dubai TV, Oman TV, PAC Ltd, Saudi TV, Syria TV, © 2007, 2011, 2015 Trustees of the University of Pennsylvania

<h3>简介</h3><br><p>GALE第三阶段阿拉伯语广播对话语音数据集(第一部分)由语言数据联盟(Linguistic Data Consortium, LDC)开发,包含约123小时阿拉伯语广播对话语音素材,该素材于2007年由LDC、突尼斯突尼斯市的MediaNet以及摩洛哥拉巴特的MTC在美国国防高级研究计划局(Defense Advanced Research Projects Agency, DARPA)GALE(全球自主语言开发,Global Autonomous Language Exploitation)项目第三阶段中采集。</p><br><p>配套转录文本以《GALE第三阶段阿拉伯语广播对话转录文本(第一部分)》形式发布,对应编号为<a href="../../../LDC2015T16">LDC2015T16</a>。</p><br><p>GALE项目的广播音频采集工作分别在LDC位于美国宾夕法尼亚州费城的本地设施,以及三个远程采集站点完成:中国香港科技大学(采集中文素材)、突尼斯突尼斯市的Medianet(采集阿拉伯语素材),以及摩洛哥拉巴特的MTC(采集阿拉伯语素材)。本地采集与外包采集相结合的模式,为GALE项目每周提供约300小时的节目素材,素材来自50余个广播源,项目全周期累计采集广播音频超过30000小时。</p><br><p>LDC的本地广播采集系统具备高度自动化、易扩展且鲁棒性强的特点,可每日从数十个信号源采集、处理并评估数百小时的内容。该系统的信号来源包括多套免费收视(Free-to-Air, FTA)卫星接收机、DirecTV等商用直播卫星系统(Direct Satellite System, DSS)、直接广播卫星(Direct Broadcast Satellite, DBS)接收机,以及有线电视(Cable Television, CATV)信号源。接收机与录像机之间的映射关系采用动态模块化设计。所有信号路由均通过计算机控制的256×64音视频矩阵切换器完成。节目先以高带宽音视频格式录制,随后经处理提取音频、生成关键帧与压缩音视频文件、生成时间同步的隐藏式字幕(针对北美英语素材),并输出自动语音识别(Automatic Speech Recognition, ASR)结果。本发布包中包含的《广播音频采集指南3.0版》对该系统、采集信号源及录制实验室配置进行了概述。</p><br><p>LDC专为远程广播采集设计了便携式采集平台。该平台采用类似TiVO的数字视频录制(Digital Video Recording, DVR)系统,可同时录制两路音视频流。它支持模拟CATV(NTSC与PAL制式)及FTA DVB-S卫星节目,可在美国境外部署。该平台占地面积小、重量不足30磅,可作为随身行李携带。</p><br><p>Medianet通过自有采集系统与2008年部署的LDC便携式广播采集平台,从海湾地区各地采集阿拉伯语节目。部署在突尼斯Medianet采集点的便携式平台,可从多个信号源采集多路区域阿拉伯语节目。MTC则通过自有采集系统完成阿拉伯语节目采集。</p><br><h3>数据集内容</h3><br><p>本发布包中的广播对话录音涵盖访谈、热线互动节目及圆桌讨论等内容,主题以时事为主,素材来自以下广播源:阿拉伯联合酋长国阿布扎比阿布扎比电视台(Abu Dhabi TV)、伊朗阿拉姆新闻频道(Al Alam News Channel)、迪拜阿拉伯卫视台(Al Arabiya)、卡塔尔多哈地区性广播机构半岛电视台(Aljazeera)、约旦国家广播电台阿尔奥迪尼耶电视台(Al Ordiniyah)、阿联酋迪拜电视台(Dubai TV)、黎巴嫩黎巴嫩广播公司(Lebanese Broadcasting Corporation)、阿曼苏丹国国家广播机构阿曼电视台(Oman TV)、沙特阿拉伯国家电视台沙特电视台(Saudi TV),以及叙利亚国家电视台叙利亚电视台(Syria TV)。</p><br><p>本发布包包含149个音频文件,采用<a href="http://flac.sourceforge.net">FLAC</a>压缩波形音频文件格式(.flac),参数为16000Hz单声道16位PCM。本发布包中包含《审核流程规范2.0版》,所有音频文件均由阿拉伯语母语审核员按照该规范完成审核。广播音频审核流程主要实现三大目标:一是通过识别失效、不完整或存在瑕疵的录音,校验广播采集系统设备的运行状态;二是通过识别误录节目案例,反映广播节目排期变更情况;三是通过留存节目类型、数据类别与主题信息,为数据筛选提供依据。</p><br><h3>示例</h3><br><p>请收听此<a href="desc/addenda/LDC2015S11.wav">音频示例</a>。</p><br><h3>更新说明</h3><br><p>暂无更新。</p><br><h3>致谢</h3><br><p>本项目部分受美国国防高级研究计划局GALE项目资助(资助编号:HR0011-06-1-0003)。本发布内容不一定代表美国政府的立场或政策,不应被视为获得官方背书。</p><br><p>部分内容 © 2007 阿布扎比电视台、阿拉姆新闻频道、阿拉伯卫视台、半岛电视台、阿尔奥迪尼耶电视台、迪拜电视台、阿曼电视台、PAC有限公司、沙特电视台、叙利亚电视台;© 2007、2011、2015 宾夕法尼亚大学理事会。</p>
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作