Santa Barbara Corpus of Spoken American English

Name: Santa Barbara Corpus of Spoken American English
Creator: www.kaggle.com
Published: 2017-09-14 00:00:00
License: 暂无描述

www.kaggle.com2017-09-14 更新2025-03-25 收录

下载链接：

https://www.kaggle.com/rtatman/santa-barbara-corpus-of-spoken-american-english

下载链接

链接失效反馈

官方服务：

资源简介：

### Context: The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. The corpus was collected by the University of California, Santa Barbara Center for the Study of Discourse, Director John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB). Each speech file is accompanied by a transcript in which phrases are time stamped with respect to the audio recording. Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. The audio data consists of MP3 format speech files, recorded in two-channel pcm, at 22050Hz. ### Contents: This dataset contains part one of the corpus. The other three parts and additional information can be found [here](http://www.linguistics.ucsb.edu/research/santa-barbara-corpus#Contents). The following information is included in this dataset: * Recordings: 14 recordings as .mp3 files * Transcripts: Time-aligned transcripts for all 14 recordings, in the [CHAT format](http://childes.talkbank.org/) * Metadata: A .csv with demographic information on speakers, as well as which recordings they appear in. (Some talkers appear in more than one recording.) ### Acknowledgements: The Santa Barbara Corpus was compiled by researchers in the Linguistics Department of the University of California, Santa Barbara. The Director of the Santa Barbara Corpus is John W. Du Bois, working with Associate Editors Wallace L. Chafe and Sandra A. Thompson (all of UC Santa Barbara), and Charles Meyer (UMass, Boston). For the publication of Parts 3 and 4, the authors are John W. Du Bois and Robert Englebretson. It is distributed here under an [CC BY-ND 3.0 US license]( https://creativecommons.org/licenses/by-nd/3.0/us/). ### Inspiration: * Currently, the transcriptions are close transcriptions and include disfluencies and overlaps. Can you use NLP to convert them to broad transcriptions without this information? * Can you create a phone-aligned transcription of this dataset? You might find it helpful to use [forced alignment](https://www.eleanorchodroff.com/tutorial/kaldi/kaldi-forcedalignment.html).

### 背景信息基于美国各地数百段自然口语录音的圣巴巴拉口语美式英语语料库，涵盖了来自不同地区、年龄、职业、民族和社会背景的各类人群。该语料库反映了人们在生活中运用语言的各种方式：对话、闲谈、争论、职场交流、牌局、市政会议、销售演讲、课堂讲座、政治演说、睡前故事、布道、婚礼等。该语料库由加州大学圣巴巴拉分校话语研究中心收集，由主任约翰·W·杜博斯（UCSB）负责，副主编包括华莱士·L·查菲（UCSB）、查尔斯·迈耶（UMass，波士顿）和桑德拉·A·汤普森（UCSB）。每个语音文件都附有与之对应的转录文本，其中短语的时间戳与音频录音相对应。转录文本中对个人姓名、地点名称、电话号码等进行了修改，以保护说话者和他们的熟人的匿名性，音频文件也经过过滤，使这些录音部分难以辨认。尽管如此，从这些过滤后的录音部分仍可恢复音高信息，但相对于原始信号，这些区域的振幅水平已降低。音频数据由MP3格式的语音文件组成，以双声道pcm格式，采样率为22050Hz。 ### 内容本数据集包含语料库的第一部分。其他三个部分及附加信息可在此[链接](http://www.linguistics.ucsb.edu/research/santa-barbara-corpus#Contents)找到。本数据集包含以下信息： * 录音：14段录音，格式为.mp3 * 转录：所有14段录音的时间对齐转录文本，格式为[CHAT格式](http://childes.talkbank.org/) * 元数据：包含说话者人口统计信息和他们在哪些录音中出现的.csv文件。（一些说话者出现在多个录音中。） ### 致谢圣巴巴拉语料库由加州大学圣巴巴拉分校语言学系的学者编制。圣巴巴拉语料库的主任为约翰·W·杜博斯，副主编包括华莱士·L·查菲和桑德拉·A·汤普森（均来自UC Santa Barbara），以及查尔斯·迈耶（UMass，波士顿）。第三部分和第四部分的出版作者为约翰·W·杜博斯和罗伯特·恩格莱布赖森。本语料库在此以[CC BY-ND 3.0 US许可](https://creativecommons.org/licenses/by-nd/3.0/us/)分发。 ### 启发 * 目前，转录文本为精确转录，包括不流畅和重叠部分。能否利用自然语言处理技术将其转换为宽泛转录，而不包含此类信息？ * 能否为该数据集创建电话对齐的转录文本？您可能会发现使用[强制对齐](https://www.eleanorchodroff.com/tutorial/kaldi/kaldi-forcedalignment.html)技术有所帮助。

提供机构：

www.kaggle.com

5,000+

优质数据集

54 个

任务类型

进入经典数据集