CS-Dialogue

Name: CS-Dialogue
Creator: maas
Published: 2026-05-13 19:50:30
License: 暂无描述

魔搭社区2026-05-13 更新2025-11-29 收录

下载链接：

https://modelscope.cn/datasets/BAAI/CS-Dialogue

下载链接

链接失效反馈

官方服务：

资源简介：

# CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition [![Hugging Face Datasets](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/datasets/BAAI/CS-Dialogue) [![arXiv](https://img.shields.io/badge/arXiv-2502.18913-b31b1b.svg)](https://arxiv.org/abs/2502.18913) [![License: CC BY-NC-SA-4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) ## Introduction **CS-Dialogue** is a large-scale, publicly available Mandarin-English code-switching speech dialogue dataset. This dataset solves key problems found in existing code-switching speech datasets — mainly their small size, lack of natural conversations, and missing full-length dialogue recordings. It provides a solid foundation for advancing research in code-switching ASR and other related fields. The dataset is released under a **CC BY-NC-SA 4.0 license**, meaning it is available for non-commercial use. ## Dataset Details This dataset contains 104.02 hours of spontaneous dialogue recordings, consisting of 100 pairs of two-person conversations recorded by 200 speakers. Key features of the dataset include: * **Speakers:** 200 speakers with strong English proficiency (e.g., IELTS ≥ 6 or passing TEM-4). * **Geographic Diversity:** Speakers come from 30 provincial-level regions across mainland China. * **Content:** Each conversation covers 2 to 6 topics and includes Mandarin-only, code-switching, and English-only segments. * **Audio Format:** WAV files with a 16kHz sampling rate. * **Transcriptions:** Carefully crafted, character-level manual transcriptions. * **Annotations:** The dataset includes annotations for each utterance, and for the speakers level. * **Utterance-level**: `id`, `audio` (file path), `text` (transcription). * **Speaker-level**: `speaker_id`, `age`, `gender`, `location` (province), `device`. ### Dataset Structure The dataset file structure is as follows. ``` data ├── long_wav/*.tar.gz ├── short_wav/*.tar.gz └── index ├── long_wav │ ├── dev.txt │ ├── test.txt │ └── train.txt ├── short_wav │ ├── dev │ │ ├── text │ │ └── wav.scp │ ├── test │ │ ├── text │ │ └── wav.scp │ └── train │ ├── text │ └── wav.scp └── total_infomation └── Information_Index.txt ``` For more details, please refer to our paper [CS-Dialogue](https://arxiv.org/abs/2502.18913). ## 📚 Cite me ``` @article{zhou2025cs, title={CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition}, author={Zhou, Jiaming and Guo, Yujie and Zhao, Shiwan and Sun, Haoqin and Wang, Hui and He, Jiabei and Kong, Aobo and Wang, Shiyao and Yang, Xi and Wang, Yequan and others}, journal={arXiv preprint arXiv:2502.18913}, year={2025} } ```

# CS-Dialogue：面向语音识别的104小时自然普通话-英语语码转换对话数据集 [![Hugging Face 数据集](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-%E6%95%B0%E6%8D%AE%E9%9B%86-yellow)](https://huggingface.co/datasets/BAAI/CS-Dialogue) [![arXiv](https://img.shields.io/badge/arXiv-2502.18913-b31b1b.svg)](https://arxiv.org/abs/2502.18913) [![知识共享协议：CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) ## 简介 **CS-Dialogue** 是大规模、可公开获取的普通话-英语语码转换语音对话数据集。本数据集解决了现有语码转换语音数据集存在的核心痛点：体量偏小、缺乏自然对话场景、缺少完整时长的对话录音，为推动语码转换自动语音识别（Automatic Speech Recognition, ASR）及相关领域的研究提供了坚实基础。该数据集采用**CC BY-NC-SA 4.0协议**发布，允许非商业性使用。 ## 数据集详情本数据集包含104.02小时的自然对话录音，由200名说话者录制的100组双人对话组成。该数据集的核心特性如下： * **说话者信息**：共200名英语能力出众的说话者（例如雅思成绩≥6分或通过英语专业四级考试）。 * **地域多样性**：说话者来自中国大陆30个省级行政区。 * **对话内容**：每组对话涵盖2至6个主题，包含纯普通话、语码转换及纯英语片段。 * **音频格式**：采样率为16kHz的WAV格式文件。 * **转写文本**：经过精心打磨的字符级人工转写结果。 * **标注信息**：数据集包含话语级与说话者级别的两类标注： * **话语级标注**：`id`、`audio`（文件路径）、`text`（转写文本）。 * **说话者级标注**：`speaker_id`、`age`（年龄）、`gender`（性别）、`location`（所属省份）、`device`（录制设备）。 ### 数据集文件结构数据集的文件目录结构如下： data ├── long_wav/*.tar.gz ├── short_wav/*.tar.gz └── index ├── long_wav │ ├── dev.txt │ ├── test.txt │ └── train.txt ├── short_wav │ ├── dev │ │ ├── text │ │ └── wav.scp │ ├── test │ │ ├── text │ │ └── wav.scp │ └── train │ ├── text │ └── wav.scp └── total_infomation └── Information_Index.txt 如需获取更多细节，请参阅我们的论文[CS-Dialogue](https://arxiv.org/abs/2502.18913)。 ## 📚 引用格式 @article{zhou2025cs, title={CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition}, author={Zhou, Jiaming and Guo, Yujie and Zhao, Shiwan and Sun, Haoqin and Wang, Hui and He, Jiabei and Kong, Aobo and Wang, Shiyao and Yang, Xi and Wang, Yequan and others}, journal={arXiv preprint arXiv:2502.18913}, year={2025} }

提供机构：

maas

创建时间：

2025-11-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集