RealTalk-CN
收藏魔搭社区2026-05-16 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/BAAI/RealTalk-CN
下载链接
链接失效反馈官方服务:
资源简介:
# RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis
📌 **Resources:**
- [GitHub Repository](https://github.com/Summer-Enzhi/RealTalk)
- [Arxiv Paper](https://arxiv.org/abs/2508.10015)
**RealTalk-CN** is the first large-scale, multi-domain, bimodal (speech-text) Chinese **Task-Oriented Dialogue (TOD)** dataset. All data come from real human-to-human conversations, specifically constructed to advance research on speech-based large language models (Speech LLMs). Existing TOD datasets are mostly text-based, lacking real speech, spontaneous disfluencies, and cross-modal interaction scenarios. RealTalk-CN achieves breakthroughs in these aspects, fully supporting Chinese speech dialogue modeling and evaluation.
The dataset is released under the **CC BY-NC-SA 4.0 license**, and can be freely used for non-commercial research.
---
## Dataset Composition
- **Total Duration:** ~150 hours of verified real human-to-human dialogue audio
- **Dialogue Scale:** 5,400 multi-turn dialogues, over 60,000 utterances
- **Speakers:** 113 individuals, balanced gender ratio, ages 18–50, covering major dialect regions across China
- **Dialogue Domains:** 58 task-oriented domains (e.g., dining, transportation, shopping, healthcare, finance), including 55 intents and 115 slots
- **Audio Specifications:** 16kHz sampling rate, WAV format, recorded via both professional and mobile devices
- **Transcription & Annotation:**
- Manually transcribed at the character level, preserving spoken language features
- Annotated with 4 categories of disfluencies (elongation, repetition, self-correction, hesitation)
- Includes transcriptions, slot values, intents, and speaker metadata (gender, age, region, etc.)
---
## Dataset Features
1. **Natural and Colloquial:** Contains spoken features and disfluencies in real task-oriented dialogues, overcoming the limitation of “read speech” corpora.
2. **Bimodal and Real Interaction:** Provides paired speech-text annotations and introduces a *cross-modal chat task*, supporting dynamic switching between speech and text—closer to real-world human-computer interaction.
3. **Complete Dialogues and Multi-Domain Coverage:** Average of 12 turns per dialogue, covering 58 real-world domains, supporting both single-domain and cross-domain dialogue modeling.
4. **Diverse Speakers:** Covers major regions in China, balanced across gender and age, enabling research on the impact of accents, dialects, and demographic differences.
5. **High-Quality Annotation and Strict Quality Control:** Multiple rounds of manual verification, detailed timestamps, and slot annotations ensure reliability and research value.
---
## Advantages
- The **first large-scale Chinese speech-text TOD corpus**, filling the gap in benchmark datasets for Chinese spoken dialogue.
- Provides **disfluency annotations**, supporting robustness evaluation and error correction research in speech-based TOD systems.
- Enables research in **speech recognition, speech synthesis, intent recognition, slot filling, dialogue management, and cross-modal studies**.
- Serves as a **benchmark for Speech LLMs in Chinese TOD tasks**, driving the development of advanced speech interaction systems.
---
## Conclusion
The release of **RealTalk-CN** lays the foundation for research in Chinese speech-text bimodal dialogue. With its **large scale, multi-domain coverage, natural spoken language, diverse speakers, and cross-modal interaction**, it not only advances the development of Speech LLMs in task-oriented dialogue but also provides a key resource for future cross-modal and multimodal intelligent systems.
# RealTalk-CN:一款支持跨模态交互分析的真实中文语音-文本对话基准数据集
📌 **资源:**
- [GitHub仓库](https://github.com/Summer-Enzhi/RealTalk)
- [Arxiv论文](https://arxiv.org/abs/2508.10015)
**RealTalk-CN** 是首个大规模、多领域、双模态(语音-文本)的中文**任务型对话(Task-Oriented Dialogue, TOD)**数据集。所有数据均来自真实的人际对话,专为推动基于语音的大语言模型(Speech LLMs)的相关研究而构建。现有的任务型对话数据集多为文本型,缺乏真实语音、自然口语不流畅现象以及跨模态交互场景。RealTalk-CN在上述方面实现了突破,可全面支撑中文语音对话建模与评测工作。
本数据集采用**CC BY-NC-SA 4.0许可协议**发布,可免费用于非商业性研究。
---
## 数据集构成
- **总时长:** 约150小时经过验证的真实人际对话音频
- **对话规模:** 5400轮多轮对话,超60000条话语
- **说话人:** 113名个体,性别比例均衡,年龄覆盖18至50岁,涵盖中国主要方言区域
- **对话领域:** 58个任务型对话领域(如餐饮、交通、购物、医疗、金融),包含55个意图与115个槽位
- **音频规格:** 16kHz采样率,WAV格式,通过专业设备与移动设备录制
- **转录与标注:**
- 基于字符级手动转录,保留口语化语言特征
- 标注了4类语音不流畅现象(延长音、重复、自我修正、犹豫)
- 包含转录文本、槽位值、意图以及说话人元数据(性别、年龄、地域等)
---
## 数据集特性
1. **自然口语化:** 包含真实任务型对话中的口语特征与不流畅现象,突破了“朗读式语音”语料库的局限。
2. **双模态真实交互:** 提供配对的语音-文本标注,并引入*跨模态对话任务*,支持语音与文本间的动态切换,更贴近真实人机交互场景。
3. **完整对话与多领域覆盖:** 平均每段对话含12轮,覆盖58个真实世界领域,可支撑单领域与跨领域对话建模研究。
4. **多样化说话人:** 覆盖中国主要地域,性别与年龄分布均衡,可用于研究口音、方言及人口统计学差异的影响。
5. **高质量标注与严格质控:** 经过多轮人工验证、详细时间戳与槽位标注,确保数据集的可靠性与研究价值。
---
## 优势
- **首个大规模中文语音-文本任务型对话语料库**,填补了中文口语对话基准数据集的空白。
- 提供**语音不流畅现象标注**,可支撑基于语音的任务型对话系统的鲁棒性评测与错误校正研究。
- 可用于**语音识别、语音合成、意图识别、槽位填充、对话管理以及跨模态研究**等方向。
- 可作为**中文任务型对话场景下语音大语言模型的评测基准**,推动先进语音交互系统的发展。
---
## 结语
**RealTalk-CN**的发布为中文语音-文本双模态对话研究奠定了基础。凭借其**大规模、多领域覆盖、自然口语化、多样化说话人以及跨模态交互**等特性,本数据集不仅推动了任务型对话场景下语音大语言模型的发展,更为未来跨模态与多模态智能系统提供了关键资源。
提供机构:
maas
创建时间:
2025-11-07
搜集汇总
数据集介绍

背景与挑战
背景概述
RealTalk-CN是一个大规模中文语音-文本双模态对话数据集,包含约150小时的真实人类对话音频和对应文本,覆盖58个任务型对话领域。该数据集具有自然口语特征、多样化说话者和高质量标注等特点,支持语音识别、语音合成等多种研究方向,填补了中文口语对话基准数据集的空白。
以上内容由遇见数据集搜集并总结生成



