RealTalk-CN

Name: RealTalk-CN
Creator: maas
Published: 2026-05-16 23:03:37
License: 暂无描述

魔搭社区2026-05-16 更新2025-11-15 收录

下载链接：

https://modelscope.cn/datasets/BAAI/RealTalk-CN

下载链接

链接失效反馈

官方服务：

资源简介：

# RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis 📌 **Resources:** - [GitHub Repository](https://github.com/Summer-Enzhi/RealTalk) - [Arxiv Paper](https://arxiv.org/abs/2508.10015) **RealTalk-CN** is the first large-scale, multi-domain, bimodal (speech-text) Chinese **Task-Oriented Dialogue (TOD)** dataset. All data come from real human-to-human conversations, specifically constructed to advance research on speech-based large language models (Speech LLMs). Existing TOD datasets are mostly text-based, lacking real speech, spontaneous disfluencies, and cross-modal interaction scenarios. RealTalk-CN achieves breakthroughs in these aspects, fully supporting Chinese speech dialogue modeling and evaluation. The dataset is released under the **CC BY-NC-SA 4.0 license**, and can be freely used for non-commercial research. --- ## Dataset Composition - **Total Duration:** ~150 hours of verified real human-to-human dialogue audio - **Dialogue Scale:** 5,400 multi-turn dialogues, over 60,000 utterances - **Speakers:** 113 individuals, balanced gender ratio, ages 18–50, covering major dialect regions across China - **Dialogue Domains:** 58 task-oriented domains (e.g., dining, transportation, shopping, healthcare, finance), including 55 intents and 115 slots - **Audio Specifications:** 16kHz sampling rate, WAV format, recorded via both professional and mobile devices - **Transcription & Annotation:** - Manually transcribed at the character level, preserving spoken language features - Annotated with 4 categories of disfluencies (elongation, repetition, self-correction, hesitation) - Includes transcriptions, slot values, intents, and speaker metadata (gender, age, region, etc.) --- ## Dataset Features 1. **Natural and Colloquial:** Contains spoken features and disfluencies in real task-oriented dialogues, overcoming the limitation of “read speech” corpora. 2. **Bimodal and Real Interaction:** Provides paired speech-text annotations and introduces a *cross-modal chat task*, supporting dynamic switching between speech and text—closer to real-world human-computer interaction. 3. **Complete Dialogues and Multi-Domain Coverage:** Average of 12 turns per dialogue, covering 58 real-world domains, supporting both single-domain and cross-domain dialogue modeling. 4. **Diverse Speakers:** Covers major regions in China, balanced across gender and age, enabling research on the impact of accents, dialects, and demographic differences. 5. **High-Quality Annotation and Strict Quality Control:** Multiple rounds of manual verification, detailed timestamps, and slot annotations ensure reliability and research value. --- ## Advantages - The **first large-scale Chinese speech-text TOD corpus**, filling the gap in benchmark datasets for Chinese spoken dialogue. - Provides **disfluency annotations**, supporting robustness evaluation and error correction research in speech-based TOD systems. - Enables research in **speech recognition, speech synthesis, intent recognition, slot filling, dialogue management, and cross-modal studies**. - Serves as a **benchmark for Speech LLMs in Chinese TOD tasks**, driving the development of advanced speech interaction systems. --- ## Conclusion The release of **RealTalk-CN** lays the foundation for research in Chinese speech-text bimodal dialogue. With its **large scale, multi-domain coverage, natural spoken language, diverse speakers, and cross-modal interaction**, it not only advances the development of Speech LLMs in task-oriented dialogue but also provides a key resource for future cross-modal and multimodal intelligent systems.

# RealTalk-CN：一款支持跨模态交互分析的真实中文语音-文本对话基准数据集 📌 **资源：** - [GitHub仓库](https://github.com/Summer-Enzhi/RealTalk) - [Arxiv论文](https://arxiv.org/abs/2508.10015) **RealTalk-CN** 是首个大规模、多领域、双模态（语音-文本）的中文**任务型对话（Task-Oriented Dialogue, TOD）**数据集。所有数据均来自真实的人际对话，专为推动基于语音的大语言模型（Speech LLMs）的相关研究而构建。现有的任务型对话数据集多为文本型，缺乏真实语音、自然口语不流畅现象以及跨模态交互场景。RealTalk-CN在上述方面实现了突破，可全面支撑中文语音对话建模与评测工作。本数据集采用**CC BY-NC-SA 4.0许可协议**发布，可免费用于非商业性研究。 --- ## 数据集构成 - **总时长：** 约150小时经过验证的真实人际对话音频 - **对话规模：** 5400轮多轮对话，超60000条话语 - **说话人：** 113名个体，性别比例均衡，年龄覆盖18至50岁，涵盖中国主要方言区域 - **对话领域：** 58个任务型对话领域（如餐饮、交通、购物、医疗、金融），包含55个意图与115个槽位 - **音频规格：** 16kHz采样率，WAV格式，通过专业设备与移动设备录制 - **转录与标注：** - 基于字符级手动转录，保留口语化语言特征 - 标注了4类语音不流畅现象（延长音、重复、自我修正、犹豫） - 包含转录文本、槽位值、意图以及说话人元数据（性别、年龄、地域等） --- ## 数据集特性 1. **自然口语化：** 包含真实任务型对话中的口语特征与不流畅现象，突破了“朗读式语音”语料库的局限。 2. **双模态真实交互：** 提供配对的语音-文本标注，并引入*跨模态对话任务*，支持语音与文本间的动态切换，更贴近真实人机交互场景。 3. **完整对话与多领域覆盖：** 平均每段对话含12轮，覆盖58个真实世界领域，可支撑单领域与跨领域对话建模研究。 4. **多样化说话人：** 覆盖中国主要地域，性别与年龄分布均衡，可用于研究口音、方言及人口统计学差异的影响。 5. **高质量标注与严格质控：** 经过多轮人工验证、详细时间戳与槽位标注，确保数据集的可靠性与研究价值。 --- ## 优势 - **首个大规模中文语音-文本任务型对话语料库**，填补了中文口语对话基准数据集的空白。 - 提供**语音不流畅现象标注**，可支撑基于语音的任务型对话系统的鲁棒性评测与错误校正研究。 - 可用于**语音识别、语音合成、意图识别、槽位填充、对话管理以及跨模态研究**等方向。 - 可作为**中文任务型对话场景下语音大语言模型的评测基准**，推动先进语音交互系统的发展。 --- ## 结语 **RealTalk-CN**的发布为中文语音-文本双模态对话研究奠定了基础。凭借其**大规模、多领域覆盖、自然口语化、多样化说话人以及跨模态交互**等特性，本数据集不仅推动了任务型对话场景下语音大语言模型的发展，更为未来跨模态与多模态智能系统提供了关键资源。

提供机构：

maas

创建时间：

2025-11-07

搜集汇总

数据集介绍