five

LDC Spoken Language Sampler

收藏
Mendeley Data2024-01-31 更新2024-06-27 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2008S08
下载链接
链接失效反馈
官方服务:
资源简介:
Introduction The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. In 2008, LDC is a growing consortium that includes more than 100 companies, universities, and government members that has distributed over 50,000 corpora to a global audience. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with potential information providers and would-be members, and maintaining relations with other like-minded groups around the world. Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. Data The LDC Spoken Language Sampler provides a variety of speech, transcript and lexicon samples and is designed to illustrate the variety and breadth of the resources available from LDC Publication Catalog. most excerpts are truncated to be much shorter than the original files, typically one minute and thirty seconds of speech signal amplitude has been adjusted where necessary to normalize playback volume some corpora are published in compressed form, but all samples here are uncompressed LDC typically uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. The sampler includes samples from the following corpora and lexicons. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts. An English Dictionary of the Tamil Verb This dictionary contains translations for over 6000 English verbs and defines over 9000 Tamil verbs. Entries include the English word, the Tamil equivalent in transliteration and Tamil script and audio examples in Spoken Tamil pronunciation. CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil. CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. CALLHOME Spanish A corpus of 120 unscripted telephone conversations between native Spanish speakers and a corpus of associated transcripts. CSLU Kids Speech Developed at Oregeon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. Grassfields Bantu Fieldwork: Dschang Tone Paradigms Tone paradigms from Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon. Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. Korean Telephone Speech Collection of 100 telephone conversations between native Korean speakers and their transcriptions. Mawukakan Lexicon The first publication of an ongoing project aiming to build an electronic dictionary of four Mandekan [Eastern Manding languages of the Mande Group of the Niger-Congo family] languages. Nationwide Speech Project A database of speech representing current regional accents and dialects of the United States. NIST Pilot Meeting Speech Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place. West Point Russian Speech Utterances of sentences in Russian from 1,891 native and non-native speakers. How to Obtain The LDC Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler. Download 74 mb Portions © 2008 Trustees of the University of Pennsylvania

## 引言 宾夕法尼亚大学语言数据联盟(Linguistic Data Consortium, LDC)为关注人类语言研究的研究人员、工程师与教育工作者,提供丰富且持续扩充的各类语言资源。过往多数语言资源仅局限于单个实验室或有限用户群体,并未对相关研究者开放。受布朗大学文本语料库等一批易于获取的知名数据集成功案例启发,LDC于1992年成立,旨在为大规模语料库开发与资源共享提供全新机制。至2008年,LDC已发展为拥有超100家企业、高校及政府成员的联盟组织,累计向全球用户分发超过5万个语料库。依托成员单位支持,LDC可为语言研究社群提供多项关键服务:维护数据档案库、通过DVD-ROM、CD-ROM或网络下载等介质制作并分发数据、与潜在信息提供者及意向成员洽谈知识产权协议,以及与全球其他志同道合的学术机构保持合作关系。LDC(访问地址:http://www.ldc.upenn.edu)提供的资源涵盖多语言语音、文本、视频数据与词典,以及助力语料材料使用的软件工具。 ## 数据说明 LDC口语语言采样集(LDC Spoken Language Sampler)提供各类语音、转写文本与词典示例,旨在展示LDC发布目录中可用资源的多样性与广度。多数节选内容均经截断,长度远短于原始文件,通常保留1分30秒的语音信号,并在必要时调整音量以统一播放电平。部分语料库以压缩形式发布,但本采样集中的所有示例均为未压缩格式。LDC通常采用NIST SPHERE格式存储音频数据,但本采样集中的音频文件已转换为MS-WAV/音频(RIFF)格式,以兼容主流浏览器音频工具。本采样集包含以下语料库与词典的示例:音频示例时长介于30秒至90秒之间,并附带转写文本。 1. 《泰米尔语动词英语词典》(An English Dictionary of the Tamil Verb):本词典收录超6000个英语动词的译项,并定义了超9000个泰米尔语动词。词条包含英语原词、音译形式与泰米尔文字书写的泰米尔语对应词,以及泰米尔口语发音的音频示例。 2. CALLFRIEND 波斯语(CALLFRIEND Farsi):包含60段未经脚本设计的朋友与熟人之间的母语电话通话语料,语种为波斯语。 3. CALLFRIEND 泰米尔语(CALLFRIEND Tamil):包含60段未经脚本设计的朋友与熟人之间的母语电话通话语料,语种为泰米尔语。 4. CALLHOME 日语(CALLHOME Japanese):包含120段以日语为母语的使用者之间的无脚本电话对话语料,以及配套的转写文本语料库。 5. CALLHOME 西班牙语(CALLHOME Spanish):包含120段以西班牙语为母语的使用者之间的无脚本电话对话语料,以及配套的转写文本语料库。 6. CSLU儿童语音语料库(CSLU Kids Speech):由俄勒冈州立大学口语语言理解中心开发,该语料库收录了1100名从幼儿园到十年级学生的自发与引导式语音数据。 7. Fisher 黎凡特阿拉伯语(Fisher Levantine Arabic):收录279段黎凡特阿拉伯语电话对话及转写文本,来自多个国籍的使用者。 8. Grassfields班图语田野调查:姜语声调范式(Grassfields Bantu Fieldwork: Dschang Tone Paradigms):来自Yémba语(班米莱克姜语,属于Grassfields班图语族)的声调范式,该语言在喀麦隆西南部有超30万使用者。 9. 海湾阿拉伯语会话电话语音(Gulf Arabic Conversational Telephone Speech):收录波斯湾地区使用者的975段电话对话及其转写文本。 10. 韩语电话语音语料库(Korean Telephone Speech):收录100段以韩语为母语的使用者之间的电话对话及其转写文本。 11. Mawukakan词典(Mawukakan Lexicon):首个公开的持续项目成果,该项目旨在构建尼日尔-刚果语系曼德集团东部曼丁语支的四种曼德坎语电子词典。 12. 全国语音项目(Nationwide Speech Project):收录代表美国当前区域口音与方言的语音数据库。 13. NIST试点会议语音语料库(NIST Pilot Meeting Speech):收录会议场景下主题讨论的语音与转写文本,包含完整的描述性元数据,以及讨论举办场所的详细物理环境说明。 14. 西点军校俄语语音语料库(West Point Russian Speech):收录1891名母语与非母语俄语使用者的语句发音样本。 ## 获取方式 LDC口语语言采样集可免费下载。该采样集为Gnu压缩tar文件,多数解压工具均可对其进行解压。下载文件大小为74 MB。 ## 版权声明 部分内容 © 2008 宾夕法尼亚大学董事会。
创建时间:
2024-01-31
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
LDC Spoken Language Sampler是一个多语言语音样本集,包含多种语言的语音、转录和词典样本,旨在展示LDC提供的语言资源多样性。数据集样本经过截短和信号调整,转换为MS-WAV格式以便于使用,适合研究人员和工程师进行语言研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作