five

CS-Dialogue

收藏
魔搭社区2026-05-13 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/BAAI/CS-Dialogue
下载链接
链接失效反馈
官方服务:
资源简介:
# CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition [![Hugging Face Datasets](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/datasets/BAAI/CS-Dialogue) [![arXiv](https://img.shields.io/badge/arXiv-2502.18913-b31b1b.svg)](https://arxiv.org/abs/2502.18913) [![License: CC BY-NC-SA-4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) ## Introduction **CS-Dialogue** is a large-scale, publicly available Mandarin-English code-switching speech dialogue dataset. This dataset solves key problems found in existing code-switching speech datasets — mainly their small size, lack of natural conversations, and missing full-length dialogue recordings. It provides a solid foundation for advancing research in code-switching ASR and other related fields. The dataset is released under a **CC BY-NC-SA 4.0 license**, meaning it is available for non-commercial use. ## Dataset Details This dataset contains 104.02 hours of spontaneous dialogue recordings, consisting of 100 pairs of two-person conversations recorded by 200 speakers. Key features of the dataset include: * **Speakers:** 200 speakers with strong English proficiency (e.g., IELTS ≥ 6 or passing TEM-4). * **Geographic Diversity:** Speakers come from 30 provincial-level regions across mainland China. * **Content:** Each conversation covers 2 to 6 topics and includes Mandarin-only, code-switching, and English-only segments. * **Audio Format:** WAV files with a 16kHz sampling rate. * **Transcriptions:** Carefully crafted, character-level manual transcriptions. * **Annotations:** The dataset includes annotations for each utterance, and for the speakers level. * **Utterance-level**: `id`, `audio` (file path), `text` (transcription). * **Speaker-level**: `speaker_id`, `age`, `gender`, `location` (province), `device`. ### Dataset Structure The dataset file structure is as follows. ``` data ├── long_wav/*.tar.gz ├── short_wav/*.tar.gz └── index ├── long_wav │ ├── dev.txt │ ├── test.txt │ └── train.txt ├── short_wav │ ├── dev │ │ ├── text │ │ └── wav.scp │ ├── test │ │ ├── text │ │ └── wav.scp │ └── train │ ├── text │ └── wav.scp └── total_infomation └── Information_Index.txt ``` For more details, please refer to our paper [CS-Dialogue](https://arxiv.org/abs/2502.18913). ## 📚 Cite me ``` @article{zhou2025cs, title={CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition}, author={Zhou, Jiaming and Guo, Yujie and Zhao, Shiwan and Sun, Haoqin and Wang, Hui and He, Jiabei and Kong, Aobo and Wang, Shiyao and Yang, Xi and Wang, Yequan and others}, journal={arXiv preprint arXiv:2502.18913}, year={2025} } ```

# CS-Dialogue:面向语音识别的104小时自然普通话-英语语码转换对话数据集 [![Hugging Face 数据集](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-%E6%95%B0%E6%8D%AE%E9%9B%86-yellow)](https://huggingface.co/datasets/BAAI/CS-Dialogue) [![arXiv](https://img.shields.io/badge/arXiv-2502.18913-b31b1b.svg)](https://arxiv.org/abs/2502.18913) [![知识共享协议:CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) ## 简介 **CS-Dialogue** 是大规模、可公开获取的普通话-英语语码转换语音对话数据集。本数据集解决了现有语码转换语音数据集存在的核心痛点:体量偏小、缺乏自然对话场景、缺少完整时长的对话录音,为推动语码转换自动语音识别(Automatic Speech Recognition, ASR)及相关领域的研究提供了坚实基础。该数据集采用**CC BY-NC-SA 4.0协议**发布,允许非商业性使用。 ## 数据集详情 本数据集包含104.02小时的自然对话录音,由200名说话者录制的100组双人对话组成。该数据集的核心特性如下: * **说话者信息**:共200名英语能力出众的说话者(例如雅思成绩≥6分或通过英语专业四级考试)。 * **地域多样性**:说话者来自中国大陆30个省级行政区。 * **对话内容**:每组对话涵盖2至6个主题,包含纯普通话、语码转换及纯英语片段。 * **音频格式**:采样率为16kHz的WAV格式文件。 * **转写文本**:经过精心打磨的字符级人工转写结果。 * **标注信息**:数据集包含话语级与说话者级别的两类标注: * **话语级标注**:`id`、`audio`(文件路径)、`text`(转写文本)。 * **说话者级标注**:`speaker_id`、`age`(年龄)、`gender`(性别)、`location`(所属省份)、`device`(录制设备)。 ### 数据集文件结构 数据集的文件目录结构如下: data ├── long_wav/*.tar.gz ├── short_wav/*.tar.gz └── index ├── long_wav │ ├── dev.txt │ ├── test.txt │ └── train.txt ├── short_wav │ ├── dev │ │ ├── text │ │ └── wav.scp │ ├── test │ │ ├── text │ │ └── wav.scp │ └── train │ ├── text │ └── wav.scp └── total_infomation └── Information_Index.txt 如需获取更多细节,请参阅我们的论文[CS-Dialogue](https://arxiv.org/abs/2502.18913)。 ## 📚 引用格式 @article{zhou2025cs, title={CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition}, author={Zhou, Jiaming and Guo, Yujie and Zhao, Shiwan and Sun, Haoqin and Wang, Hui and He, Jiabei and Kong, Aobo and Wang, Shiyao and Yang, Xi and Wang, Yequan and others}, journal={arXiv preprint arXiv:2502.18913}, year={2025} }
提供机构:
maas
创建时间:
2025-11-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作