Second DIHARD Challenge Development - Eleven Sources

Name: Second DIHARD Challenge Development - Eleven Sources
Creator: Linguistic Data Consortium
Published: 2021-11-15 18:46:47
License: 暂无描述

DataCite Commons2021-11-15 更新2024-07-13 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2021S10

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> Second DIHARD Challenge Development - Eleven Sources was developed by LDC and contains approximately 22 hours of English and Chinese speech data along with corresponding annotations used in support of the <a href="https://dihardchallenge.github.io/dihard2">Second DIHARD Challenge</a>. The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As with the <a href="https://dihardchallenge.github.io/dihard1/">first challenge</a>, the second development and evaluation sets were drawn from a diverse sampling of sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and YouTube videos. <h3>Data</h3> This release, when combined with Second DIHARD Challenge Development - SEEDLingS (<a href="../../../LDC2021S11">LDC2021S11</a>), contains the development set audio data and annotation, except for CHiME-5 audio files, which must be obtained from the <a href="https://licensing.sheffield.ac.uk/product/chime5">University of Sheffield</a>. Data sources used in this release are as follows (all sources are in English unless otherwise indicated): <ul> <li>Autism Diagnosis Observation Schedule (ADOS) interviews</li> <li>CHiME-5 dinner party recordings (annotations only in this release)</li> <li>Conversations in Restaurants</li> <li>DCIEM/HCRC map task<a href="../../../LDC96S38"> (LDC96S38)</a></li> <li>Audiobook recordings from <a href="https://librivox.org/">LibriVox</a></li> <li>Meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) Development (<a href="../../../LDC2007S11">LDC2007S11</a>) and Evaluation (<a href="../../../LDC2007S12">LDC2007S12</a>) releases</li> <li>2001 U.S. Supreme Court oral arguments</li> <li>Sociolinguistic interviews from SLX Corpus of Classic Sociolinguistic Interviews (<a href="../../../LDC2003T15">LDC2003T15</a>)</li> <li>Mixer 6 Speech (<a href="../../../LDC2013S03">LDC2013S03</a>)</li> <li>English and Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project</li> <li>YouthPoint radio interviews</li> </ul> All audio is provided in the form of 16 kHz, 16-bit, mono-channel FLAC files. The diarization for each recording is stored as a NIST Rich Transcription Time Marked (RTTM) file. RTTM files are space-separated text files containing one turn per line. Segmentation files are stored as HTK label files. Each of these files contains one speech segment per line. Scoring regions for each recording are specific by un-partitioned evaluation map (UEM) files. All annotation file types are encoded as UTF-8. More information about file formats, data sources and domains is contained in the included documentation. <h3>Samples</h3> Please view these samples: <ul> <li><a href="desc/addenda/LDC2021S10.flac">Audio Sample (FLAC)</a></li> <li><a href="desc/addenda/LDC2021S10.lab">Label Sample (TXT)</a></li> <li><a href="desc/addenda/LDC2021S10.rttm">RTTM Sample (TXT)</a></li> </ul> <h3>Updates</h3> None at this time. Portions © 1995 Defence and Civil Institute of Environmental Medicine, © 2002 Interactive Systems Laboratories, Carnegie Mellon University, © 2000-2001 International Computer Science Institute, © SIL International (IPA93 Fonts), © 2011-2018 YouTube, LLC, © 1996, 2001, 2003, 2004, 2007, 2009-2010, 2013, 2018, 2019, 2021 Trustees of the University of Pennsylvania

<h3>引言</h3> 第二届DIHARD挑战赛开发集——11个数据源由LDC开发，包含约22小时的英语与汉语语音数据及配套标注，用于支持<a href="https://dihardchallenge.github.io/dihard2">第二届DIHARD挑战赛</a>。 DIHARD挑战赛是一系列聚焦于「困难场景说话人diarization」的共享任务，即针对极具挑战性的语料库开展语音说话人diarization任务——这类语料库被认为现有顶尖系统的表现会欠佳。与<a href="https://dihardchallenge.github.io/dihard1/">首届DIHARD挑战赛</a>一致，本次开发集与评测集的数据源涵盖独白、地图任务对话、广播访谈、社会语言学访谈、会议语音、餐厅场景语音、临床录音、儿童语言习得长期录音以及YouTube视频。 <h3>数据说明</h3> 本数据集与第二届DIHARD挑战赛开发集——SEEDLingS（<a href="../../../LDC2021S11">LDC2021S11</a>）搭配发布，包含开发集的音频数据与标注，但CHiME-5音频文件除外，此类文件需从<a href="https://licensing.sheffield.ac.uk/product/chime5">谢菲尔德大学</a>获取。 本次发布包含以下数据源（若无特别说明，所有数据源均为英语）： <ul> <li>自闭症诊断观察量表（Autism Diagnosis Observation Schedule，ADOS）访谈</li> <li>CHiME-5晚宴录音（本次发布仅包含标注）</li> <li>餐厅会话语料</li> <li>DCIEM/HCRC地图任务（<a href="../../../LDC96S38">LDC96S38</a>）</li> <li>LibriVox有声书录音</li> <li>2004年春季NIST富转录（RT-04S）开发集（<a href="../../../LDC2007S11">LDC2007S11</a>）与评测集（<a href="../../../LDC2007S12">LDC2007S12</a>）中的会议语音语料</li> <li>2001年美国最高法院口头辩论录音</li> <li>经典社会语言学访谈SLX语料库（<a href="../../../LDC2003T15">LDC2003T15</a>）中的社会语言学访谈录音</li> <li>Mixer 6语音语料（<a href="../../../LDC2013S03">LDC2013S03</a>）</li> <li>LDC作为语音技术视频标注（Video Annotation for Speech Technologies，VAST）项目一部分收集的英语与汉语视频语音数据</li> <li>YouthPoint电台访谈录音</li> </ul> 所有音频均采用16 kHz、16位单声道FLAC格式存储。每份录音的说话人diarization标注存储为NIST富转录时间标记（NIST Rich Transcription Time Marked，RTTM）文件。RTTM文件为以空格分隔的文本文件，每行对应一个说话人轮次。分段文件存储为HTK标签文件，每行对应一个语音片段。每份录音的评测区域由未分区评测映射（un-partitioned evaluation map，UEM）文件指定。所有标注文件均采用UTF-8编码。更多关于文件格式、数据源与领域的细节，请参见随附文档。 <h3>示例</h3> 请查看以下示例： <ul> <li><a href="desc/addenda/LDC2021S10.flac">音频示例（FLAC）</a></li> <li><a href="desc/addenda/LDC2021S10.lab">标签示例（TXT）</a></li> <li><a href="desc/addenda/LDC2021S10.rttm">RTTM示例（TXT）</a></li> </ul> <h3>更新说明</h3> 暂无更新。 部分内容 © 1995 国防与民用环境医学研究所，© 2002 卡内基梅隆大学交互系统实验室，© 2000-2001 国际计算机科学研究所，© SIL International（IPA93字体），© 2011-2018 YouTube有限责任公司，© 1996、2001、2003、2004、2007、2009-2010、2013、2018、2019、2021 宾夕法尼亚大学托管委员会

提供机构：

Linguistic Data Consortium

创建时间：

2021-11-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集