Switchboard-1 Release 2
收藏Mendeley Data2024-01-31 更新2024-06-27 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC97S62
下载链接
链接失效反馈官方服务:
资源简介:
Introduction The Switchboard-1 Telephone Speech Corpus (LDC97S62) consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed. Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. Data In this release, assembled and published by the LDC, all known errors affecting the original publication of speech files were corrected. In addition, modifications have been made to the contents of the NIST Sphere headers of all speech files, to identify each file as being part of the new release and to make the usage of the sample_count header field consistent with standard Sphere usage. (In particular, the sample_count field should reflect the number of samples on each channel in the file. In the initial release, this field was improperly set to be the total number of samples in both channels of the file this has been corrected in the new release.) Since the 1997 release, the Switchboard transcripts have been carefully revised at The Institute for Signal and Information Processing (ISIP) and additional problems have been discovered and patched. Three speech files, part of the original release, were inadvertently left off the 1997 revision. After corpus users noted some problems in the original speaker attribution table, LDC audited the problem calls and corrected the attributions. The latest version of ISIP transcriptions, the ISIP update of the ICSI phonetic transcriptions, and corrected word alignments are all available at ISIP. The LDC makes the transcript summaries available via http. Researchers have used SWB-1 data for various annotation projects including discourse annotation/speech acts, part-of-speech tagging and parsing, up-to-date orthographic transcriptions, and phonetic transcriptions. This summary documents which files have been used for the various annotations. In addition to the index of these file characteristics, there is also a table detailing speaker attributes. Samples Please view this audio sample. Updates 08/11/2015: The three files from the 03/26/2013 update were converted into unshortened sphere. File tables and documentation were updated to reflect the conversion of these files. The corpus is also now available as a web download. All copies of this corpora obtained after the above date include this update. 03/26/2013: Three previously missing files were added to this release. (sw02289.sph, sw04361.sph, sw04379.sph) File tables and documentation were updated to reflect the addition of these files. Please contact ldc@ldc.upenn.edu to obtain this update. All copies of this corpora obtained after the above date already include this update. 09/29/2011: Added a file list, available through online docs, to reflect its release on DVD. Also, an updated readme reflects these changes. 11/12/2007: Updated and corrected speaker and call tables are now available online in the corpus documentation directory at https://catalog.ldc.upenn.edu/docs/LDC97S62/ 09/2008: The Switchboard Dialog Act Corpus is a version of Switchboard-1 Release 2 tagged with a shallow discourse tagset of approximately 60 basic dialog act tags and combinations. The discourse tag-set used is an augmentation of the Discourse Annotation and Markup System of Labeling (DAMSL) tag-set and is referred to as the SWBD-DAMSL labels. These annotations were created in 1997 at the University of Colorado at Boulder, with the goal of building better language models for automatic speech recognition of the Switchboard domain. To that end, the label-set incorporates both traditional sociolinguistic and discourse-theoretic rhetorical relations/adjacency-pairs as well as some more form-based models. This corpus contains labels for 1155 5-minute conversations comprising 205,000 utterances and 1.4 million words. The Switchboard Dialog Act Corpus is available as a free download via the online documentation folder. Portions © 1992, 1993, 1997 Trustees of the University of Pennsylvania
## 引言
Switchboard-1电话语音语料库(Switchboard-1 Telephone Speech Corpus, LDC97S62)包含约260小时语音数据,于1990至1991年由德克萨斯仪器公司在美国国防高级研究计划局(Defense Advanced Research Projects Agency, DARPA)资助下首次收集。该语料库的首版于1992至1993年由美国国家标准与技术研究院(National Institute of Standards and Technology, NIST)发布,并由语言数据联盟(Linguistic Data Consortium, LDC)分发。自该版本发布以来,原始CD-ROM套装中的数据文件已完成多轮修正,首批印刷的所有副本均已完成分发。
Switchboard语料库包含约2400段双向电话对话,参与说话者共计543名(男性302名,女性241名),来自美国全境。该语料库由计算机驱动的机器人操作员系统统筹通话流程:播放预设录制提示音,选择并拨打通话对象(被叫方),引入讨论话题,并将双方语音分别录制至独立声道,直至通话结束。语料库共提供约70个讨论话题,其中约50个被高频使用。话题与被叫方的选择遵循以下约束:(1)任意两名说话者仅可进行一次对话;(2)同一话题仅允许一名说话者参与不超过一次。
## 数据说明
本次由LDC汇编发布的版本中,所有影响语音文件原始发布的已知错误均已修复。此外,所有语音文件的NIST Sphere文件头内容已完成修改,用于标识该文件属于本次新版,并使sample_count(采样计数)头字段的使用符合标准Sphere格式规范(具体而言,sample_count字段应反映文件中每个声道的采样点数。在初始版本中,该字段被错误设置为文件两个声道的总采样点数,此问题已在新版中修正)。
自1997年发布以来,信号与信息处理研究所(Institute for Signal and Information Processing, ISIP)已对Switchboard的转写文本进行了精心修订,发现并修复了更多遗留问题。原始发布版中的三个语音文件被意外遗漏在1997年的修订版中。在语料库用户反馈原始说话者归属表存在问题后,LDC对相关通话进行了审核并修正了归属信息。最新版ISIP转写文本、ICSI语音转写的ISIP更新版以及修正后的词对齐文件均可在ISIP平台获取。LDC可通过HTTP协议提供转写摘要服务。研究人员已将SWB-1语料应用于各类标注项目,包括话语标注/言语行为标注、词性标注与句法分析、最新正字法转写以及语音转写。本摘要记录了各文件参与的各类标注任务。除上述文件特征索引外,还提供了一份详细说明说话者属性的表格。
## 样本
请查看此音频样本。
## 更新记录
2015年8月11日:2013年3月26日更新中的三个文件已转换为未压缩的Sphere格式。文件表格与文档已同步更新以反映此次转换操作。本语料库现已支持网络下载,2015年8月11日之后获取的所有副本均包含此更新。
2013年3月26日:本次发布新增了三个此前遗漏的语音文件(sw02289.sph、sw04361.sph、sw04379.sph)。文件表格与文档已更新以记录此次新增。可通过联系ldc@ldc.upenn.edu获取此更新,2013年3月26日之后获取的所有副本均已包含该更新。
2011年9月29日:新增了一份可通过在线文档获取的文件列表,以反映其DVD版的发布。同步更新的自述文件也记录了此次变更。
2007年11月12日:更新并修正后的说话者与通话表格现已可在语料库文档目录https://catalog.ldc.upenn.edu/docs/LDC97S62/ 在线获取。
2008年9月:Switchboard对话行为语料库是Switchboard-1第二版的衍生版本,采用包含约60种基础对话行为标签及其组合的浅层话语标签集完成标注。所使用的话语标签集是对话语标注与标记系统(Discourse Annotation and Markup System of Labeling, DAMSL)标签集的扩展,被称为SWBD-DAMSL标签。该标注工作于1997年由科罗拉多大学博尔德分校完成,目标是为Switchboard领域的自动语音识别任务构建更优质的语言模型。为此,该标签集既融入了传统社会语言学与话语理论中的修辞关系/邻接对,也纳入了部分基于形式化的建模方法。本语料库包含1155段5分钟对话的标注信息,共计205,000条话语与140万个单词。Switchboard对话行为语料库可通过在线文档文件夹免费下载。
**版权声明**:部分内容 © 1992、1993、1997 宾夕法尼亚大学托管会
创建时间:
2024-01-31
搜集汇总
数据集介绍

背景与挑战
背景概述
Switchboard-1 Release 2是一个包含约260小时英语电话对话的语音数据集,涉及543名说话者和多种话题,主要用于说话者识别和语音识别研究。数据集经过多次修订和错误纠正,具有多通道录音和话题多样性等特点。
以上内容由遇见数据集搜集并总结生成



