huckiyang/DiPCo
收藏Hugging Face2024-02-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/huckiyang/DiPCo
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
license: cdla-permissive-1.0
language_creators:
- expert-generated
size_categories:
- 100M<n<100G
language:
- en
task_categories:
- automatic-speech-recognition
- voice-activity-detection
multilinguality:
- monolingual
task_ids: []
pretty_name: DipCo
tags:
- speaker separation
- speech-recognition
- microphone array processing
---
# DipCo - Dinner Party Corpus, Interspeech 2020
- Please consider to use Zenodo Data Backup Link to Download Audio: https://zenodo.org/record/8122551
- Paper: https://www.isca-speech.org/archive/interspeech_2020/segbroeck20_interspeech.html
- Author(s):
- Van Segbroeck, Maarten; Zaid, Ahmed; Kutsenko, Ksenia; Huerta, Cirenia; Nguyen, Tinh; Luo, Xuewen; Hoffmeister, Björn; Trmal, Jan; Omologo, Maurizio; Maas, Roland
- Contact person(s):
- Maas, Roland; Hoffmeister, Björn
- Distributor(s):
- Yang, Huck
### Only Download Dipco from Zenodo EU Open Link
```
wget --limit-rate=5m https://zenodo.org/record/8122551/files/DipCo.tgz?download=1
-czvf DipCo.tgz Dipco/
```
The ‘DipCo’ data corpus is a new data set that was publicly released by Amazon to help speech scientists address the difficult problem of separating speech signals in reverberant rooms with multiple speakers.
The corpus was created with the assistance of Amazon volunteers, who simulated the dinner-party scenario in the lab. We conducted multiple sessions, each involving four participants. At the beginning of each session, participants served themselves food from a buffet table. Most of the session took place at a dining table, and at fixed points in several sessions, we piped music into the room, to reproduce a noise source that will be common in real-world environments.
Each participant was outfitted with a headset microphone, which captured a clear, speaker-specific signal. Also dispersed around the room were five devices with seven microphones each, which fed audio signals directly to an administrator’s laptop. In each session, music playback started at a given time mark. The close-talk recordings were segmented and separately transcribed.
## Sessions
Each session contains the close talk recordings of 4 participants and the far-field recordings from the 5 devices. The following name conventions are used:
* sessions have a ```<session_id>``` label denoted by ```S01, S02, S03, ...``
* participants have a ```<speaker_id>``` label denoted by ```P01, P02, P03, P04, ...```
* devices have a ```<device_id>``` label denoted by ```U01, U02, U03, U04, U05```
* array microphone have a ```<channel_id>``` label denoted by ```CH1, CH2, CH3, CH4, CH5, CH6, CH7```
We currently have the following sessions:
| **Session** | **Participants** | **Hours** **[hh:mm]** | **#Utts** | **Music start [hh:mm:ss]** |
| ----------- | ------------------------------ | ---------------------- | --------- | -------------------------- |
| S01 | P01, **P02**, **P03**, P04 | 00:47 | 903 | 00:38:52 |
| S02 | **P05**, **P06**, **P07**, P08 | 00:30 | 448 | 00:19:30 |
| S03 | **P09**, **P10**, **P11**, P12 | 00:46 | 1128 | 00:33:45 |
| S04 | **P13**, P14, **P15**, P16 | 00:45 | 1294 | 00:23:25 |
| S05 | **P17**, **P18**, **P19**, P20 | 00:45 | 1012 | 00:31:15 |
| S06 | **P21**, P22, **P23**, **P24** | 00:20 | 604 | 00:06:17 |
| S07 | **P21**, P22, **P23**, **P24** | 00:26 | 632 | 00:10:05 |
| S08 | **P25**, P26, P27, P28 | 00:15 | 352 | 00:01:02 |
| S09 | P29, **P30**, P31, **P32** | 00:22 | 505 | 00:12:18 |
| S10 | P29, **P30**, P31, **P32** | 00:20 | 432 | 00:07:10 |
The sessions have been split into a development and evaluation set as follows:
| **Dataset** | **Sessions** | **Hours** [**hh:mm**] | **#Utts** |
| ----------- | ----------------------- | ----------------------- | --------- |
| Dev | S02, S04, S05, S09, S10 | 02:43 | 3691 |
| Eval | S01, S03, S06, S07, S08 | 02:36 | 3619 |
The DiPCo data set has the following directory structure:
```bash
DiPCo/
├── audio
│ ├── dev
│ └── eval
└── transcriptions
├── dev
└── eval
```
## Audio
The audio data is converted into WAV format with a sample rate of 16kHz and 16-bit precision. The close-talk recordings were made by monaural microphone and contain a single channel. The far-field recordings of all 5 devices were microphone array recordings and contain 7 raw audio channels.
The WAV file name convention is as follows:
* close talk recording of session ```<session_id>``` and participant ```<speaker_id>```
* ```<session_id>_<speaker_id>.wav```, e.g. ```S01_P03.wav```
* farfield recording of microphone ```<channel_id>``` of session ```<session_id>``` and device ```<device_id>```
* ```<session_id>_<device_id>.<channel_id>.wav```, e.g. ```S02_U3.CH1.wav```
## Transcriptions
Per session, a JSON format transcription file ```<session_id>.json``` has been provided. The JSON files contains for each transcribed utterance the following metadata:
* Session ID ("session_id")
* Speaker ID ("speaker_id")
* Gender ("gender_id")
* Mother Tongue ("mother_tongue")
* Nativeness ("nativeness")
* Transcription ("words")
* Start time of utterance ("start_time")
* The close-talk microphone recording of the speaker (```close-talk```)
* The farfield microphone array recordings of devices with ```<device_id>``` label
* End time ("end_time")
* Reference signal that was used transcribing the audio ("ref")
The following is an example annotation of one utterance in a JSON file:
```json
{
"start_time": {
"U01": "00:02:12.79",
"U02": "00:02:12.79",
"U03": "00:02:12.79",
"U04": "00:02:12.79",
"U05": "00:02:12.79",
"close-talk": "00:02:12.79"
},
"end_time": {
"U01": "00:02:14.84",
"U02": "00:02:14.84",
"U03": "00:02:14.84",
"U04": "00:02:14.84",
"U05": "00:02:14.84",
"close-talk": "00:02:14.84"
},
"gender": "male",
"mother_tongue": "U.S. English",
"nativeness": "native",
"ref": "close-talk",
"session_id": "S02",
"speaker_id": "P05",
"words": "[noise] how do you like the food"
},
```
Transcriptions include the following tags:
- [noise] noise made by the speaker (coughing, lip smacking, clearing throat, breathing, etc.)
- [unintelligible] speech was not well understood by transcriber
- [laugh] participant laughing
## License Summary
The DiPCo data set has been released under the CDLA-Permissive license. See the LICENSE file.
annotations_creators:
- 专家生成
license: cdla-permissive-1.0
language_creators:
- 专家生成
size_categories:
- 100M<n<100G
language:
- en
task_categories:
- 自动语音识别(automatic-speech-recognition)
- 语音活动检测(voice-activity-detection)
multilinguality:
- 单语言(monolingual)
task_ids: []
pretty_name: DipCo
tags:
- 说话人分离(speaker separation)
- 语音识别(speech-recognition)
- 麦克风阵列处理(microphone array processing)
# DipCo——晚宴语料库,Interspeech 2020
- 建议通过Zenodo数据备份链接下载音频:https://zenodo.org/record/8122551
- 论文:https://www.isca-speech.org/archive/interspeech_2020/segbroeck20_interspeech.html
- 作者:
- Van Segbroeck, Maarten; Zaid, Ahmed; Kutsenko, Ksenia; Huerta, Cirenia; Nguyen, Tinh; Luo, Xuewen; Hoffmeister, Björn; Trmal, Jan; Omologo, Maurizio; Maas, Roland
- 联系人:
- Maarten, Roland; Hoffmeister, Björn
- 分发者:
- Yang, Huck
### 仅通过Zenodo欧盟公开链接下载DipCo语料库
wget --limit-rate=5m https://zenodo.org/record/8122551/files/DipCo.tgz?download=1
-czvf DipCo.tgz Dipco/
本‘DipCo’语料库是亚马逊(Amazon)公开发布的全新数据集,旨在助力语音科研人员解决多说话人混响房间内的语音信号分离难题。
本语料库由亚马逊志愿者协助搭建,在实验室中模拟了晚宴场景。我们共开展多轮实验,每轮均有4名参与者。每轮实验开始时,参与者从自助餐台自取食物;实验大部分环节在餐桌进行,部分轮次中会定时向房间内播放音乐,以还原真实环境中常见的背景噪声源。
每名参与者均佩戴头戴式麦克风,可采集到清晰的专属说话人信号。房间内还部署了5台设备,每台配备7个麦克风,将采集到的音频信号直接传输至管理员的笔记本电脑。每轮实验中,音乐播放会在预设时间点启动。近距录音(close-talk)会被分段并单独标注转录。
## 实验轮次说明
每轮实验包含4名参与者的近距录音,以及5台设备的远场录音,命名规则如下:
* 实验轮次:使用`<session_id>`标识,格式为`S01, S02, S03, ...`
* 参与者:使用`<speaker_id>`标识,格式为`P01, P02, P03, P04, ...`
* 设备:使用`<device_id>`标识,格式为`U01, U02, U03, U04, U05`
* 阵列麦克风通道:使用`<channel_id>`标识,格式为`CH1, CH2, CH3, CH4, CH5, CH6, CH7`
当前已开展的实验轮次如下:
| **会话ID** | **参与者ID** | **时长 [hh:mm]** | **语句数** | **音乐启动时间 [hh:mm:ss]** |
| ----------- | ------------------------------ | ---------------------- | --------- | -------------------------- |
| S01 | P01, **P02**, **P03**, P04 | 00:47 | 903 | 00:38:52 |
| S02 | **P05**, **P06**, **P07**, P08 | 00:30 | 448 | 00:19:30 |
| S03 | **P09**, **P10**, **P11**, P12 | 00:46 | 1128 | 00:33:45 |
| S04 | **P13**, P14, **P15**, P16 | 00:45 | 1294 | 00:23:25 |
| S05 | **P17**, **P18**, **P19**, P20 | 00:45 | 1012 | 00:31:15 |
| S06 | **P21**, P22, **P23**, **P24** | 00:20 | 604 | 00:06:17 |
| S07 | **P21**, P22, **P23**, **P24** | 00:26 | 632 | 00:10:05 |
| S08 | **P25**, P26, P27, P28 | 00:15 | 352 | 00:01:02 |
| S09 | P29, **P30**, P31, **P32** | 00:22 | 505 | 00:12:18 |
| S10 | P29, **P30**, P31, **P32** | 00:20 | 432 | 00:07:10 |
实验轮次已划分为开发集与测试集,划分方式如下:
| **数据集** | **实验轮次** | **总时长 [hh:mm]** | **总语句数** |
| ----------- | ----------------------- | ----------------------- | --------- |
| Dev | S02, S04, S05, S09, S10 | 02:43 | 3691 |
| Eval | S01, S03, S06, S07, S08 | 02:36 | 3619 |
DipCo数据集的目录结构如下:
bash
DiPCo/
├── audio
│ ├── dev
│ └── eval
└── transcriptions
├── dev
└── eval
## 音频数据说明
音频数据已转换为WAV格式,采样率为16kHz,位深为16比特。近距录音(close-talk)采用单声道麦克风采集,仅包含单通道音频;5台设备的远场录音均为麦克风阵列录音,包含7个原始音频通道。
WAV文件命名规则如下:
* 对应实验轮次`<session_id>`与参与者`<speaker_id>`的近距录音:
命名格式为`<session_id>_<speaker_id>.wav`,示例:`S01_P03.wav`
* 对应实验轮次`<session_id>`、设备`<device_id>`与麦克风通道`<channel_id>`的远场录音:
命名格式为`<session_id>_<device_id>.<channel_id>.wav`,示例:`S02_U03.CH1.wav`
## 转录文本说明
每轮实验均提供一个JSON格式的转录文件`<session_id>.json`,文件中为每条转录语句包含以下元数据:
* 实验轮次ID("session_id")
* 说话人ID("speaker_id")
* 性别("gender_id")
* 母语("mother_tongue")
* 母语使用者属性("nativeness")
* 转录文本("words")
* 语句开始时间("start_time")
* 说话人近距麦克风录音(`close-talk`)的时间戳
* 各`<device_id>`标识设备的远场麦克风阵列录音的时间戳
* 语句结束时间("end_time")
* 转录音频时使用的参考信号("ref")
以下为JSON文件中单条语句的标注示例:
json
{
"start_time": {
"U01": "00:02:12.79",
"U02": "00:02:12.79",
"U03": "00:02:12.79",
"U04": "00:02:12.79",
"U05": "00:02:12.79",
"close-talk": "00:02:12.79"
},
"end_time": {
"U01": "00:02:14.84",
"U02": "00:02:14.84",
"U03": "00:02:14.84",
"U04": "00:02:14.84",
"U05": "00:02:14.84",
"close-talk": "00:02:14.84"
},
"gender": "male",
"mother_tongue": "U.S. English",
"nativeness": "native",
"ref": "close-talk",
"session_id": "S02",
"speaker_id": "P05",
"words": "[noise] how do you like the food"
},
转录文本包含以下标注标签:
- [noise] 说话者发出的噪声(如咳嗽、咂嘴、清嗓、呼吸等)
- [unintelligible] 转录人员无法听清的语音
- [laugh] 参与者的笑声
## 许可证摘要
DipCo数据集采用CDLA-Permissive许可证发布,详细信息请查看LICENSE文件。
提供机构:
huckiyang
原始信息汇总
数据集概述
数据集名称
- 名称: DipCo - Dinner Party Corpus
- 别名: DiPCo
数据集属性
- 语言: 英语 (en)
- 任务类别: 自动语音识别 (automatic-speech-recognition), 语音活动检测 (voice-activity-detection)
- 多语言性: 单语 (monolingual)
- 标签: 说话人分离, 语音识别, 麦克风阵列处理
- 许可证: CDLA-Permissive-1.0
- 大小范围: 100M<n<100G
- 注释创建者: 专家生成
- 语言创建者: 专家生成
数据集内容
- 音频格式: WAV, 16kHz, 16-bit
- 录音类型: 近场录音 (单声道麦克风), 远场录音 (麦克风阵列, 7通道)
- 文件命名规则:
- 近场录音:
<session_id>_<speaker_id>.wav - 远场录音:
<session_id>_<device_id>.<channel_id>.wav
- 近场录音:
- 转录格式: JSON
- 转录内容: 会话ID, 说话人ID, 性别, 母语, 语言能力, 转录文本, 开始时间, 结束时间, 参考信号
数据集结构
bash DiPCo/ ├── audio │ ├── dev │ └── eval └── transcriptions ├── dev └── eval
会话详情
- 会话数量: 10
- 参与者数量: 每会话4人
- 设备数量: 5
- 麦克风通道数量: 每设备7通道
- 会话命名规则:
<session_id>(如 S01, S02, ...) - 参与者命名规则:
<speaker_id>(如 P01, P02, ...) - 设备命名规则:
<device_id>(如 U01, U02, ...) - 麦克风通道命名规则:
<channel_id>(如 CH1, CH2, ...)
开发与评估集
- 开发集: 包含S02, S04, S05, S09, S10, 总计2小时43分钟, 3691个话语
- 评估集: 包含S01, S03, S06, S07, S08, 总计2小时36分钟, 3619个话语
转录示例
json { "start_time": { "U01": "00:02:12.79", "U02": "00:02:12.79", "U03": "00:02:12.79", "U04": "00:02:12.79", "U05": "00:02:12.79", "close-talk": "00:02:12.79" }, "end_time": { "U01": "00:02:14.84", "U02": "00:02:14.84", "U03": "00:02:14.84", "U04": "00:02:14.84", "U05": "00:02:14.84", "close-talk": "00:02:14.84" }, "gender": "male", "mother_tongue": "U.S. English", "nativeness": "native", "ref": "close-talk", "session_id": "S02", "speaker_id": "P05", "words": "[noise] how do you like the food" }
许可证
- 类型: CDLA-Permissive
- 详情: 见LICENSE文件
搜集汇总
数据集介绍

构建方式
针对多人在 reverberant room 内的对话场景,DiPCo 数据集通过模拟晚餐聚会情景,在实验室环境下由 Amazon 志愿者协助构建。数据集包含多个会话,每个会话由四名参与者进行,参与者佩戴头戴式麦克风以捕捉清晰的个体语音信号,同时房间内分散布置的五台设备各配七麦克风,直接将音频信号传输至管理员笔记本电脑。音乐在指定时间标记开始播放,近距离录音被分割并单独转录。
特点
DiPCo 数据集以单语种形式呈现,包含清晰且具有挑战性的多人对话场景。其特点在于:包含了清晰的近距离录音和远场麦克风阵列录音;提供了详尽的转录文件,包括性别、母语、语言熟练度等元数据;同时,数据集分为开发和评估两部分,方便研究者进行模型训练和验证。此外,数据集遵循 CDLA-Permissive 许可,保证了使用的灵活性。
使用方法
使用 DiPCo 数据集时,用户可以从 Zenodo 下载经过压缩的音频文件,并解压至指定目录。数据集的音频文件已转换为 16kHz 采样率和 16-bit 精度的 WAV 格式。每个会话都有对应的 JSON 格式转录文件,其中包含每个转录话语的元数据和起始结束时间。用户可根据需求,利用这些转录数据对模型进行训练或评估,同时也可根据许可协议的条款自由使用和分享数据集。
背景与挑战
背景概述
DipCo数据集,全称为Dinner Party Corpus,是在2020年Interspeech会议上公开发布的一个新的语音数据集。该数据集由亚马逊公司制作,旨在帮助语音科学家解决在具有多个扬声器的混响房间中分离语音信号的难题。数据集的构建过程中,亚马逊的志愿者在实验室模拟了晚餐聚会场景,通过多轮对话,捕捉了清晰的单个说话人信号以及房间内五个设备上的麦克风阵列信号。该数据集的发布,对于推动多说话人语音识别和分离技术的研究具有重要意义,其独特的场景设定和高质量的标注为相关领域的研究提供了宝贵的资源。
当前挑战
DipCo数据集在构建过程中所面临的挑战主要包括:如何在模拟的晚餐聚会环境中准确地捕捉和分离每个说话人的语音信号;如何在混响和背景噪声的干扰下保持语音的清晰度和可懂度;以及如何高效地处理和标注大量的麦克风阵列数据。在研究领域中,DipCo数据集所面临的挑战还包括如何提高多说话人语音识别和分离算法的准确性和鲁棒性,以及如何将这些算法有效地应用于实际的复杂场景中。
常用场景
经典使用场景
在自动语音识别与声音活动检测领域,huckiyang/DiPCo数据集以其模拟真实晚餐聚会场景的音频记录而独树一帜。该数据集被广泛应用于研究如何在 reverberant rooms(混响房间)中实现多说话人语音信号的分离,从而提供了对现实世界复杂音频环境的深入理解。
实际应用
huckiyang/DiPCo数据集的实际应用广泛,它不仅用于改进语音识别技术,还应用于智能家居系统中的声源定位、噪声抑制以及多声道音频处理等领域,为人工智能技术在现实世界的应用提供了强有力的数据支撑。
衍生相关工作
基于huckiyang/DiPCo数据集的研究成果,已衍生出一系列相关工作,包括但不限于多说话人语音分离算法的改进、鲁棒性语音识别技术的开发以及声学模型训练方法的创新。这些工作进一步推动了语音信号处理领域的发展,并拓展了该数据集的应用范围。
以上内容由遇见数据集搜集并总结生成



