five

RealTalk-CN

收藏
魔搭社区2026-05-16 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/BAAI/RealTalk-CN
下载链接
链接失效反馈
官方服务:
资源简介:
# RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis 📌 **Resources:** - [GitHub Repository](https://github.com/Summer-Enzhi/RealTalk) - [Arxiv Paper](https://arxiv.org/abs/2508.10015) **RealTalk-CN** is the first large-scale, multi-domain, bimodal (speech-text) Chinese **Task-Oriented Dialogue (TOD)** dataset. All data come from real human-to-human conversations, specifically constructed to advance research on speech-based large language models (Speech LLMs). Existing TOD datasets are mostly text-based, lacking real speech, spontaneous disfluencies, and cross-modal interaction scenarios. RealTalk-CN achieves breakthroughs in these aspects, fully supporting Chinese speech dialogue modeling and evaluation. The dataset is released under the **CC BY-NC-SA 4.0 license**, and can be freely used for non-commercial research. --- ## Dataset Composition - **Total Duration:** ~150 hours of verified real human-to-human dialogue audio - **Dialogue Scale:** 5,400 multi-turn dialogues, over 60,000 utterances - **Speakers:** 113 individuals, balanced gender ratio, ages 18–50, covering major dialect regions across China - **Dialogue Domains:** 58 task-oriented domains (e.g., dining, transportation, shopping, healthcare, finance), including 55 intents and 115 slots - **Audio Specifications:** 16kHz sampling rate, WAV format, recorded via both professional and mobile devices - **Transcription & Annotation:** - Manually transcribed at the character level, preserving spoken language features - Annotated with 4 categories of disfluencies (elongation, repetition, self-correction, hesitation) - Includes transcriptions, slot values, intents, and speaker metadata (gender, age, region, etc.) --- ## Dataset Features 1. **Natural and Colloquial:** Contains spoken features and disfluencies in real task-oriented dialogues, overcoming the limitation of “read speech” corpora. 2. **Bimodal and Real Interaction:** Provides paired speech-text annotations and introduces a *cross-modal chat task*, supporting dynamic switching between speech and text—closer to real-world human-computer interaction. 3. **Complete Dialogues and Multi-Domain Coverage:** Average of 12 turns per dialogue, covering 58 real-world domains, supporting both single-domain and cross-domain dialogue modeling. 4. **Diverse Speakers:** Covers major regions in China, balanced across gender and age, enabling research on the impact of accents, dialects, and demographic differences. 5. **High-Quality Annotation and Strict Quality Control:** Multiple rounds of manual verification, detailed timestamps, and slot annotations ensure reliability and research value. --- ## Advantages - The **first large-scale Chinese speech-text TOD corpus**, filling the gap in benchmark datasets for Chinese spoken dialogue. - Provides **disfluency annotations**, supporting robustness evaluation and error correction research in speech-based TOD systems. - Enables research in **speech recognition, speech synthesis, intent recognition, slot filling, dialogue management, and cross-modal studies**. - Serves as a **benchmark for Speech LLMs in Chinese TOD tasks**, driving the development of advanced speech interaction systems. --- ## Conclusion The release of **RealTalk-CN** lays the foundation for research in Chinese speech-text bimodal dialogue. With its **large scale, multi-domain coverage, natural spoken language, diverse speakers, and cross-modal interaction**, it not only advances the development of Speech LLMs in task-oriented dialogue but also provides a key resource for future cross-modal and multimodal intelligent systems.

# RealTalk-CN:一款支持跨模态交互分析的真实中文语音-文本对话基准数据集 📌 **资源:** - [GitHub仓库](https://github.com/Summer-Enzhi/RealTalk) - [Arxiv论文](https://arxiv.org/abs/2508.10015) **RealTalk-CN** 是首个大规模、多领域、双模态(语音-文本)的中文**任务型对话(Task-Oriented Dialogue, TOD)**数据集。所有数据均来自真实的人际对话,专为推动基于语音的大语言模型(Speech LLMs)的相关研究而构建。现有的任务型对话数据集多为文本型,缺乏真实语音、自然口语不流畅现象以及跨模态交互场景。RealTalk-CN在上述方面实现了突破,可全面支撑中文语音对话建模与评测工作。 本数据集采用**CC BY-NC-SA 4.0许可协议**发布,可免费用于非商业性研究。 --- ## 数据集构成 - **总时长:** 约150小时经过验证的真实人际对话音频 - **对话规模:** 5400轮多轮对话,超60000条话语 - **说话人:** 113名个体,性别比例均衡,年龄覆盖18至50岁,涵盖中国主要方言区域 - **对话领域:** 58个任务型对话领域(如餐饮、交通、购物、医疗、金融),包含55个意图与115个槽位 - **音频规格:** 16kHz采样率,WAV格式,通过专业设备与移动设备录制 - **转录与标注:** - 基于字符级手动转录,保留口语化语言特征 - 标注了4类语音不流畅现象(延长音、重复、自我修正、犹豫) - 包含转录文本、槽位值、意图以及说话人元数据(性别、年龄、地域等) --- ## 数据集特性 1. **自然口语化:** 包含真实任务型对话中的口语特征与不流畅现象,突破了“朗读式语音”语料库的局限。 2. **双模态真实交互:** 提供配对的语音-文本标注,并引入*跨模态对话任务*,支持语音与文本间的动态切换,更贴近真实人机交互场景。 3. **完整对话与多领域覆盖:** 平均每段对话含12轮,覆盖58个真实世界领域,可支撑单领域与跨领域对话建模研究。 4. **多样化说话人:** 覆盖中国主要地域,性别与年龄分布均衡,可用于研究口音、方言及人口统计学差异的影响。 5. **高质量标注与严格质控:** 经过多轮人工验证、详细时间戳与槽位标注,确保数据集的可靠性与研究价值。 --- ## 优势 - **首个大规模中文语音-文本任务型对话语料库**,填补了中文口语对话基准数据集的空白。 - 提供**语音不流畅现象标注**,可支撑基于语音的任务型对话系统的鲁棒性评测与错误校正研究。 - 可用于**语音识别、语音合成、意图识别、槽位填充、对话管理以及跨模态研究**等方向。 - 可作为**中文任务型对话场景下语音大语言模型的评测基准**,推动先进语音交互系统的发展。 --- ## 结语 **RealTalk-CN**的发布为中文语音-文本双模态对话研究奠定了基础。凭借其**大规模、多领域覆盖、自然口语化、多样化说话人以及跨模态交互**等特性,本数据集不仅推动了任务型对话场景下语音大语言模型的发展,更为未来跨模态与多模态智能系统提供了关键资源。
提供机构:
maas
创建时间:
2025-11-07
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
RealTalk-CN是一个大规模中文语音-文本双模态对话数据集,包含约150小时的真实人类对话音频和对应文本,覆盖58个任务型对话领域。该数据集具有自然口语特征、多样化说话者和高质量标注等特点,支持语音识别、语音合成等多种研究方向,填补了中文口语对话基准数据集的空白。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作