ICSI Meeting Speech

Mendeley Data2024-01-31 更新2024-06-29 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2004S02

下载链接

链接失效反馈

官方服务：

资源简介：

Introduction ICSI Meeting Speech was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S02 and ISBN 1-58563-285-6. The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each. Word-level orthographic transcriptions are available as ICSI Meeting Transcripts. Data The collection includes 922 speech files, for a total of approximately 72 hours of Meeting Room speech. The speech is structured as one subdirectory per meeting, containing wavefiles for each channel (and possible .blp file, specifying any censored intervals). The audio was collected at a 48 kHZ sample-rate, downsampled on the fly to 16 kHz. Audio files for each meeting are provided as separate time-synchronous recordings for each channel, encoded as 16-bit linear (big-endian) wavefiles, shorten-compressed in NIST SPHERE format. The meetings were simultaneously recorded using close-talking microphones for each speaker (generally head-mounted, but early meetings contain some lapel microphones), as well as six table-top microphones: four high-quality omnidirectional PZM microphones arrayed down the center of the conference table, and two inexpensive microphone elements mounted on a mock PDA. All meetings were recorded in the same instrumented meeting room. In addition to recording the meetings themselves, the participants were also asked to read digit strings, similar to those found in TIDIGITS, at the start or end of the meeting. This small-vocabulary read-speech component of the recordings -- using the same meeting room, speakers, and microphones -- provides a valuable supplement to the natural conversational data, allowing a factorization of the speech challenges offered by the corpus. For all but a dozen of the meetings included in the corpus, at least some of the participants read digit strings; for the great majority of meetings, all participants did. The digit readings are included as part of the wavefiles for the meeting as a whole and are fully transcribed as part of the associated transcripts. There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe. Samples Please listen to this audio sample. Sponsorship The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the Communicator project and through a ROAR "seedling," the Swiss IM2 project (National Centre of Competence in Research, sponsored by the Swiss National Science Foundation), and a supplementary award from IBM. Updates There are no updates available at this time. More information is available at http://www.ICSI.Berkeley.EDU/Speech/mr. Portions © 2000-2003 International Computer Science Institute, © 2004 Trustees of the University of Pennsylvania

本ICSI会议语音语料由语言数据联盟（Linguistic Data Consortium, LDC）发行，目录编号为LDC2004S02，国际标准书号为ISBN 1-58563-285-6。ICSI会议语料库收录了2000至2002年间，于伯克利国际计算机科学研究所（International Computer Science Institute, ICSI）采集的75场会议数据。本次收录的会议均为真实自然场景下的自发会议：它们均为ICSI各工作组的常规周会，其中包括ICSI会议项目组自身的例会。在采集此类会议语音时，我们旨在尽可能还原真实的会议互动动态与发言风格——尽管所有发言者均佩戴近距离麦克风，且知晓会议正被录制。单条语音文件时长介于17至103分钟之间，多数单场会议语音时长接近一小时。词级正字法转写结果可通过ICSI会议转写文件获取。数据概况：本语料库共包含922条语音文件，总时长约72小时的会议室场景语音。语音数据按会议划分目录，每个会议目录下包含各声道的波形文件（wavefiles），若存在需屏蔽的音频区间则附带.blp格式文件进行标注。音频采集时采用48kHz采样率，随后实时下采样至16kHz。每场会议的音频文件按声道独立存储为时间同步录音，采用16位线性（大端序）波形编码，并以NIST SPHERE格式存储，同时经过shorten压缩。本次会议采集采用多麦克风同步录音方案：为每位发言者配备近距离麦克风（多数为头戴式，早期会议部分采用领夹式麦克风），同时配置6台桌面麦克风——其中4台为高品质全向边界层麦克风（Pressure Zone Microphone, PZM），沿会议桌中心阵列排布，另外2台为低成本拾音单元，安装于模拟PDA设备上。所有会议均在同一间经过声学调试的会议室中录制。除会议本身的录音外，参会者还需在会议开始或结束时朗读与TIDIGITS数据集类似的数字串语音。该小词汇量朗读语音子集（使用与会议场景一致的会议室、发言者与麦克风设备）可作为自然会话语音数据的重要补充，便于对语料库中的语音识别挑战进行拆解分析。本语料库中仅12场会议未要求全体参会者朗读数字串，其余绝大多数会议均要求所有参会者完成该环节。数字朗读语音作为单场会议整体波形文件的一部分进行存储，其转写内容已完整包含于对应会议转写文件中。本语料库共包含53位不同的发言者。每场会议的参会人数介于3至10人之间，平均为6人。语料库中包含相当比例的非英语母语发言者，其英语流利程度从接近母语水平到难以准确转写不等。样本：请收听本条语音样本。资助说明：本语料库的采集与整理工作主要得益于美国国防高级研究计划局（Defense Advanced Research Projects Agency, DARPA）的资助，资助项目包括Communicator项目与ROAR"seedling"计划旗下的瑞士IM2项目（由瑞士国家科学基金会资助的瑞士国家研究能力中心项目），同时获得IBM的补充资助。更新说明：目前暂无数据更新。更多信息可访问：http://www.ICSI.Berkeley.EDU/Speech/mr。部分内容 © 2000-2003 国际计算机科学研究所，© 2004 宾夕法尼亚大学董事会。

创建时间：

2024-01-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集