CS-Dialogue
收藏魔搭社区2026-05-13 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/BAAI/CS-Dialogue
下载链接
链接失效反馈官方服务:
资源简介:
# CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition
[](https://huggingface.co/datasets/BAAI/CS-Dialogue)
[](https://arxiv.org/abs/2502.18913)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
## Introduction
**CS-Dialogue** is a large-scale, publicly available Mandarin-English code-switching speech dialogue dataset. This dataset solves key problems found in existing code-switching speech datasets — mainly their small size, lack of natural conversations, and missing full-length dialogue recordings. It provides a solid foundation for advancing research in code-switching ASR and other related fields. The dataset is released under a **CC BY-NC-SA 4.0 license**, meaning it is available for non-commercial use.
## Dataset Details
This dataset contains 104.02 hours of spontaneous dialogue recordings, consisting of 100 pairs of two-person conversations recorded by 200 speakers. Key features of the dataset include:
* **Speakers:** 200 speakers with strong English proficiency (e.g., IELTS ≥ 6 or passing TEM-4).
* **Geographic Diversity:** Speakers come from 30 provincial-level regions across mainland China.
* **Content:** Each conversation covers 2 to 6 topics and includes Mandarin-only, code-switching, and English-only segments.
* **Audio Format:** WAV files with a 16kHz sampling rate.
* **Transcriptions:** Carefully crafted, character-level manual transcriptions.
* **Annotations:** The dataset includes annotations for each utterance, and for the speakers level.
* **Utterance-level**: `id`, `audio` (file path), `text` (transcription).
* **Speaker-level**: `speaker_id`, `age`, `gender`, `location` (province), `device`.
### Dataset Structure
The dataset file structure is as follows.
```
data
├── long_wav/*.tar.gz
├── short_wav/*.tar.gz
└── index
├── long_wav
│ ├── dev.txt
│ ├── test.txt
│ └── train.txt
├── short_wav
│ ├── dev
│ │ ├── text
│ │ └── wav.scp
│ ├── test
│ │ ├── text
│ │ └── wav.scp
│ └── train
│ ├── text
│ └── wav.scp
└── total_infomation
└── Information_Index.txt
```
For more details, please refer to our paper [CS-Dialogue](https://arxiv.org/abs/2502.18913).
## 📚 Cite me
```
@article{zhou2025cs,
title={CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition},
author={Zhou, Jiaming and Guo, Yujie and Zhao, Shiwan and Sun, Haoqin and Wang, Hui and He, Jiabei and Kong, Aobo and Wang, Shiyao and Yang, Xi and Wang, Yequan and others},
journal={arXiv preprint arXiv:2502.18913},
year={2025}
}
```
# CS-Dialogue:面向语音识别的104小时自然普通话-英语语码转换对话数据集
[](https://huggingface.co/datasets/BAAI/CS-Dialogue)
[](https://arxiv.org/abs/2502.18913)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
## 简介
**CS-Dialogue** 是大规模、可公开获取的普通话-英语语码转换语音对话数据集。本数据集解决了现有语码转换语音数据集存在的核心痛点:体量偏小、缺乏自然对话场景、缺少完整时长的对话录音,为推动语码转换自动语音识别(Automatic Speech Recognition, ASR)及相关领域的研究提供了坚实基础。该数据集采用**CC BY-NC-SA 4.0协议**发布,允许非商业性使用。
## 数据集详情
本数据集包含104.02小时的自然对话录音,由200名说话者录制的100组双人对话组成。该数据集的核心特性如下:
* **说话者信息**:共200名英语能力出众的说话者(例如雅思成绩≥6分或通过英语专业四级考试)。
* **地域多样性**:说话者来自中国大陆30个省级行政区。
* **对话内容**:每组对话涵盖2至6个主题,包含纯普通话、语码转换及纯英语片段。
* **音频格式**:采样率为16kHz的WAV格式文件。
* **转写文本**:经过精心打磨的字符级人工转写结果。
* **标注信息**:数据集包含话语级与说话者级别的两类标注:
* **话语级标注**:`id`、`audio`(文件路径)、`text`(转写文本)。
* **说话者级标注**:`speaker_id`、`age`(年龄)、`gender`(性别)、`location`(所属省份)、`device`(录制设备)。
### 数据集文件结构
数据集的文件目录结构如下:
data
├── long_wav/*.tar.gz
├── short_wav/*.tar.gz
└── index
├── long_wav
│ ├── dev.txt
│ ├── test.txt
│ └── train.txt
├── short_wav
│ ├── dev
│ │ ├── text
│ │ └── wav.scp
│ ├── test
│ │ ├── text
│ │ └── wav.scp
│ └── train
│ ├── text
│ └── wav.scp
└── total_infomation
└── Information_Index.txt
如需获取更多细节,请参阅我们的论文[CS-Dialogue](https://arxiv.org/abs/2502.18913)。
## 📚 引用格式
@article{zhou2025cs,
title={CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition},
author={Zhou, Jiaming and Guo, Yujie and Zhao, Shiwan and Sun, Haoqin and Wang, Hui and He, Jiabei and Kong, Aobo and Wang, Shiyao and Yang, Xi and Wang, Yequan and others},
journal={arXiv preprint arXiv:2502.18913},
year={2025}
}
提供机构:
maas
创建时间:
2025-11-07



