LDC Spoken Language Sampler - Third Release

Mendeley Data2024-01-31 更新2024-06-28 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2015S09

下载链接

链接失效反馈

官方服务：

资源简介：

Introduction LDC (Linguistic Data Consortium) Spoken Language Sampler - Third Release contains samples from 20 different corpora published by LDC between 1996 and 2015. LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world. Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog. The sampler is available as a free download. Data The LDC Spoken Language Sampler - Third Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC: Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes. Signal amplitude has been adjusted where necessary to normalize playback volume. Some corpora are published in compressed form, but all samples here are uncompressed. Some text files are presented as images to ensure foreign character sets display properly. In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well. The link for the catalog number takes you to the catalog entry. LDC2014S06 2009 NIST Language Recognition Evaluation Test Set The 2009 evaluation contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese. LDC2014S01 CALLFRIEND Farsi Second Edition Speech CALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The CALLFRIEND project supported the development of language identification technology. Each CALLFRIEND corpus consists of unscripted telephone conversations lasting between 5-30 minutes. LDC96S37 CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. LDC2013S09 CSC Deceptive Speech CSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interviews from 32 native speakers of Standard American English (16 male, 16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus. LDC2007S18 CSLU Kids' Speech Developed at Oregon State University's Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10. LDC2010S01 Fisher Spanish Speech Fisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers. LDC2014S02 King Saud University Arabic Speech Database King Saud University Arabic Speech Database contains 590 hours of recorded Arabic speech from 269 male and female Saudi and non-Saudi speakers. The utterances include read and spontaneous speech recorded in quiet and noisy environments. The recordings were collected via different microphones and a mobile phone and averaged between 16-19 minutes. LDC2003S07 Korean Telephone Conversations Complete The Korean telephone conversations were originally recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts (LDC2003T08) consists of 100 text files, totaling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The complete data set also includes a lexicon (LDC2003L02). LDC2012S04 Malto Speech and Transcripts Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh. LDC2015S05 Mandarin Chinese Phonetic Segmentation and Tone Mandarin Chinese Phonetic Segmentation and Tone was developed by LDC and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese. LDC2015S04 Mandarin-English Code-Switching in South-East Asia Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia and includes approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts. LDC2013S03 Mixer 6 Speech Mixer 6 Speech was developed by LDC and is comprised of 15,863 hours of telephone speech, interviews and transcript readings from 594 distinct native English speakers. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area. LDC2014S03 Multi-Channel WSJ Audio Multi-Channel WSJ Audio was developed by the Centre for Speech Technology Research at The University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker. LDC2004S09 NIST Meeting Pilot Corpus Speech This data set contains speech and transcriptions from topical discussions in meeting settings, including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place. LDC2015S02 RATS Speech Activity Detection RATS Speech Activity Detection was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. LDC2015S03 The Subglottal Resonances Database The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age. LDC2012S02 TORGO Database of Dysarthric Articulation TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group. LDC2012S06 Turkish Broadcast News Speech and Transcripts Turkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts. LDC2014S08 United Nations Proceedings Speech United Nations Proceedings Speech was developed by the United Nations (UN) and contains approximately 8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly and First Committee (Disarmament and International Security), and meetings 6434-6763 of the Security Council. LDC2014S04 USC-SFI MALACH Interviews and Transcripts Czech USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation. Portions © 2015 Trustees of the University of Pennsylvania

**引言** 语言数据联盟（Linguistic Data Consortium，LDC）第三版口语语料采样集，收录了1996年至2015年间LDC发布的20个不同语料库的样本。语言数据联盟面向从事人类语言相关研究的科研人员、工程师与教育工作者，提供种类丰富且持续增长的语言资源。过往而言，多数语言资源仅对单一实验室或少量特定用户开放，感兴趣的研究人员通常无法获取。受布朗大学文本语料库等一批易于获取且广为人知的数据集成功案例启发，LDC于1992年成立，旨在为大规模语料库开发与资源共享提供全新机制。在会员单位的支持下，LDC为语言研究社区提供多项核心服务：维护LDC数据档案库、通过实体介质或网络下载的方式制作并分发数据、与潜在信息提供商磋商知识产权协议，以及与全球其他志同道合的学术团体保持合作关系。LDC可提供的资源涵盖多语言语音、文本、视频数据与词典，以及辅助语料材料使用的软件工具。如需完整查看LDC的所有出版物，请浏览其目录页面。本采样集可免费下载获取。 **数据说明** 语言数据联盟第三版口语语料采样集提供语音与转录文本样本，旨在展示LDC目录中各类语音相关资源的多样性与覆盖范围。本版包含的音频文件均为节选片段，且相对于LDC发布的原始数据进行了多类修改：绝大多数节选片段均被截断至远短于原始文件的长度，通常介于1.5至2分钟之间。必要时会调整信号幅度以归一化播放音量。部分语料库以压缩格式发布，但本采样集中的所有样本均为未压缩格式。部分文本文件以图像形式呈现，以确保外文字符集能够正确显示。部分公开出版物中，音频数据采用NIST SPHERE文件格式，但本采样集的音频文件均采用MS-WAV/音频（RIFF）格式，以兼容主流浏览器的音频工具。FLAC文件也已转换为WAV格式。目录编号的链接将跳转至对应的目录条目。 **LDC2014S06 2009年NIST语言识别评测测试集** 2009年评测集包含约215小时的会话电话语音与广播谈话语音，由LDC使用以下23种语言及方言采集：阿姆哈拉语、波斯尼亚语、粤语、克里奥尔语（海地）、克罗地亚语、达里语、美式英语、印度式英语、波斯语、法语、格鲁吉亚语、豪萨语、印地语、韩语、普通话、普什图语、葡萄牙语、俄语、西班牙语、土耳其语、乌克兰语、乌尔都语及越南语。 **LDC2014S01 CALLFRIEND波斯语第二版语音语料库** CALLFRIEND波斯语第二版语音语料库由LDC开发，包含约42小时的本土波斯语使用者之间的电话会话录音（共100条录音）。CALLFRIEND项目旨在支持语言识别技术的研发。每条CALLFRIEND语料库均包含时长5至30分钟的无脚本电话会话。 **LDC96S37 CALLHOME日语语料库** 该语料库包含120条本土日语使用者之间的无脚本电话会话，以及对应的配套转录文本。 **LDC2013S09 CSC欺骗性语音语料库** 该语料库由哥伦比亚大学、SRI国际公司与科罗拉多大学博尔德分校联合开发，包含32名以标准美式英语为母语的受访者（16名男性、16名女性）的32小时音频访谈录音，受访者招募自哥伦比亚大学学生群体及当地社区。本研究旨在通过从该语料库提取特征并使用机器学习技术，区分欺骗性语音与非欺骗性语音。 **LDC2007S18 CSLU儿童语音语料库** 该语料库由俄勒冈州立大学口语语言理解中心开发，收录了1100名从幼儿园至十年级学生的自发与提示式语音数据。 **LDC2010S01 Fisher西班牙语语音语料库** 该语料库包含约163小时的电话语音音频文件，来自136名以加勒比西班牙语与非加勒比西班牙语为母语的使用者。 **LDC2014S02 沙特国王大学阿拉伯语语音数据库** 该数据库包含来自269名沙特及非沙特男性、女性使用者的590小时录制阿拉伯语语音数据。语音内容包含安静与嘈杂环境下的朗读语音与自发语音，采集设备涵盖不同类型麦克风与移动电话，单条语音平均时长为16至19分钟。 **LDC2003S07 完整韩语电话会话语料库** 该语料库最初作为CALLFRIEND项目的一部分录制。韩语电话会话语音语料库包含100条电话会话，其中49条于1996年以CALLFRIEND韩语语料库的形式发布，剩余51条为此前未公开的会话。韩语电话会话转录文本（LDC2003T08）包含100个文本文件，总计约19万字与2.5万个独特词汇，所有文件均采用韩语正字法：韩语正字字符为谚文，编码格式为KSC5601（万成）系统。完整数据集还包含配套词典（LDC2003L02）。 **LDC2012S04 Malto语音与转录文本语料库** 该语料库包含约8小时的Malto语音数据，采集于2005年至2009年间，来自27名使用者（22名男性、5名女性）。此外还包含其中6小时数据的配套转录文本、英文翻译及注释。Malto语主要通行于印度东北部与孟加拉国。 **LDC2015S05 汉语语音切分与声调语料库** 该语料库由LDC开发，包含7849条汉语“语句”及其语音切分与声调标签，分为训练集与测试集。语句源自1997年汉语广播新闻语音与转录文本（HUB4-NE）（分别对应LDC98S73与LDC98T24）。该数据集包含约30小时的中国中央电视台、美国之音及洛杉矶商业电台KAZN-AM的汉语广播新闻录音。本语料库旨在研究语音边界模型在汉语强制对齐中的应用。 **LDC2015S04 东南亚汉英语码转换语料库** 该语料库由南洋理工大学与马来西亚理科大学联合开发，包含约192小时的汉英语码转换语音数据，来自156名使用者，配套有转录文本。 **LDC2013S03 Mixer 6语音语料库** 该语料库由LDC开发，包含来自594名以英语为母语的使用者的15863小时电话语音、访谈与朗读转录文本。该数据于2009年至2010年间作为Mixer项目的一部分采集，具体为第6阶段，重点关注费城地区本土美式英语使用者。 **LDC2014S03 多通道华尔街日报语音语料库** 该语料库由爱丁堡大学语音技术研究中心开发，包含来自45名以英国英语为母语的使用者的约100小时录制语音数据。参与者朗读1987年至1989年间发表的《华尔街日报》文本，录制场景分为三种：单静止发言者、两名静止重叠发言者及一名移动发言者。 **LDC2004S09 NIST会议试点语料库语音语料库** 该数据集包含会议场景下的主题讨论语音与转录文本，包含完整的描述性元数据以及讨论发生的物理环境的详细说明。 **LDC2015S02 RATS语音活动检测语料库** 该语料库由LDC开发，包含约3000小时的黎凡特阿拉伯语、英语、波斯语、普什图语及乌尔都语会话电话语音，带有语音片段的自动与手动标注。本语料库旨在为DARPA RATS（鲁棒语音自动转录）项目的语音活动检测（SAD）任务提供训练集、开发集与初始测试集。 **LDC2015S03 声门下共振数据库** 该数据库由华盛顿大学与加州大学洛杉矶分校联合开发，包含50名年龄在22至25岁之间的以美式英语为母语的使用者（25名男性、25名女性）的45小时同步麦克风与声门下加速度计录音数据。 **LDC2012S02 TORGO构音障碍语音数据库** 该数据库包含约23小时的英语语音数据、配套转录文本与文档，来自8名患有脑瘫或肌萎缩侧索硬化症的使用者（5名男性、3名女性），以及7名非构音障碍对照组使用者（4名男性、3名女性）。 **LDC2012S06 土耳其语广播新闻语音与转录文本语料库** 该语料库包含约130小时的美国之音土耳其语广播音频与对应的转录文本。 **LDC2014S08 联合国会议语音语料库** 该语料库由联合国开发，包含约8500小时的联合国六种官方语言（阿拉伯语、汉语、英语、法语、俄语与西班牙语）的会议录音数据。数据采集于2009年至2012年间，来自第64至66届联合国大会及第一委员会（裁军与国际安全）的会议，以及第6434至6763次安理会会议。 **LDC2014S04 南加州大学Shoah基金会研究所捷克语访谈与转录文本语料库** 该语料库由南加州大学Shoah基金会研究所（USC-SFI）与西波西米亚大学联合开发，作为MALACH（大规模口语档案多语言访问）项目的一部分。包含约229小时的420名受访者的访谈录音，以及配套转录文本与其他文档。部分内容 © 2015 宾夕法尼亚大学托管会

创建时间：

2024-01-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集