Yifanfan/Persona-Dialogue

Name: Yifanfan/Persona-Dialogue
Creator: Yifanfan
Published: 2026-04-02 02:34:17
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Yifanfan/Persona-Dialogue

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en tags: - audio - dialogue - multi-turn - tts - persona size_categories: - 10K<n<100K --- # Persona-Dialogue Dataset Multi-turn persona-driven dialogue dataset with synthesized speech audio. ## Overview - **Total conversations**: 21561 - **Total turns**: 165871 - **Total audio duration**: 498.0 hours - **Audio format**: WAV, mono, 24kHz - **Language**: English - **Scenarios**: 20 ## Scenarios | Scenario | Groups | |----------|--------| | Family life | 7497 | | School classroom | 2115 | | Company meeting | 1776 | | Restaurant | 1201 | | Travel group | 983 | | Friends gathering | 963 | | Library/Bookstore | 962 | | Stadium/Sports game | 915 | | Shopping center | 898 | | Concert/Music festival | 889 | | Technology exhibition | 360 | | Gym | 357 | | Art gallery | 353 | | Cafe | 348 | | Sports club | 332 | | Public transportation | 331 | | Park | 328 | | Amusement park | 327 | | Hospital | 318 | | Pet shop | 308 | ## Per-Server Breakdown | Server | Groups | Turns | Duration | |--------|--------|-------|----------| | img73 | 7167 | 55373 | 162.0h | | img75 | 5261 | 40546 | 123.4h | | img77 | 2327 | 17603 | 57.5h | | img90 | 6806 | 52349 | 155.0h | ## Data Structure Audio is stored as tar archives under `shards/{server}/tars/`. Each tar contains `audio/{server}/{group_id}/*.wav` preserving the original directory structure. ### Turn-level fields (`all_turns.jsonl`) | Field | Description | |-------|-------------| | `id` | Unique turn ID | | `conversation_id` | Unique conversation ID | | `turn_id` | 1-indexed turn number | | `scenario` | Dialogue scenario | | `topic` | Conversation topic | | `speaker` | Speaker name | | `role` | `user` or `assistant` | | `text` | Utterance text | | `audio` | Path to WAV inside tar: `audio/{server}/{group}/{file}.wav` | | `source_server` | Source server ID | ### Group-level fields (`all_groups.jsonl`) | Field | Description | |-------|-------------| | `conversation_id` | Unique conversation ID | | `scenario` | Dialogue scenario | | `topic` | Conversation topic | | `num_turns` | Number of turns | | `duration_s` | Total audio duration (seconds) | | `profiles` | Speaker persona profiles | | `dialogue` | Full dialogue | | `audio_paths` | List of audio paths inside tar | ## Generation Pipeline Dialogues generated via LLM with persona profiles, synthesized using Qwen3-TTS. Quality validated through ASR (WER < 0.2), speaker similarity (> 0.35), faithfulness and relevance checks. ## Extracting Audio ```python import tarfile, json # List all tar files for a server with open("shards/img73/tar_manifest.json") as f: manifest = json.load(f) # Extract a specific tar with tarfile.open("shards/img73/tars/img73_family_life_part01.tar") as tf: tf.extractall("./extracted/") ```

许可证：知识共享署名4.0国际许可协议（CC BY 4.0）语言： - 英语标签： - 音频 - 对话 - 多轮 - 文本转语音（Text-to-Speech, TTS） - 人设（Persona）规模类别： - 10000 < 样本数 < 100000 # 人设驱动对话数据集（Persona-Dialogue Dataset）包含合成语音音频的多轮人设驱动对话数据集。 ## 概览 - **总对话数**：21561 - **总轮次**：165871 - **总音频时长**：498.0 小时 - **音频格式**：WAV、单声道、24kHz - **语言**：英语 - **场景数量**：20 ## 对话场景 | 场景名称 | 对话组数 | |----------|--------| | 家庭生活 | 7497 | | 学校课堂 | 2115 | | 公司会议 | 1776 | | 餐厅 | 1201 | | 旅行团 | 983 | | 好友聚会 | 963 | | 图书馆/书店 | 962 | | 体育场/体育赛事 | 915 | | 购物中心 | 898 | | 演唱会/音乐节 | 889 | | 科技展会 | 360 | | 健身房 | 357 | | 美术馆 | 353 | | 咖啡馆 | 348 | | 体育俱乐部 | 332 | | 公共交通 | 331 | | 公园 | 328 | | 游乐园 | 327 | | 医院 | 318 | | 宠物店 | 308 | ## 按服务器拆分详情 | 服务器ID | 对话组数 | 总轮次 | 总时长 | |--------|--------|-------|----------| | img73 | 7167 | 55373 | 162.0h | | img75 | 5261 | 40546 | 123.4h | | img77 | 2327 | 17603 | 57.5h | | img90 | 6806 | 52349 | 155.0h | ## 数据结构音频存储于 `shards/{server}/tars/` 路径下的tar归档文件中。每个tar包包含 `audio/{server}/{group_id}/*.wav`，完整保留原始目录结构。 ### 轮次级字段（`all_turns.jsonl`） | 字段名 | 字段说明 | |-------|-------------| | `id` | 唯一轮次标识符 | | `conversation_id` | 唯一对话标识符 | | `turn_id` | 从1开始计数的轮次编号 | | `scenario` | 对话所属场景 | | `topic` | 对话主题 | | `speaker` | 说话人姓名 | | `role` | 角色类型，可选值为`user`（用户）或`assistant`（助手） | | `text` | 话语文本 | | `audio` | tar包内WAV文件路径，格式为 `audio/{server}/{group}/{file}.wav` | | `source_server` | 源服务器ID | ### 对话组级字段（`all_groups.jsonl`） | 字段名 | 字段说明 | |-------|-------------| | `conversation_id` | 唯一对话标识符 | | `scenario` | 对话所属场景 | | `topic` | 对话主题 | | `num_turns` | 总轮次数量 | | `duration_s` | 总音频时长（单位：秒） | | `profiles` | 说话人人设档案 | | `dialogue` | 完整对话内容 | | `audio_paths` | tar包内的音频路径列表 | ## 生成流程对话通过带人设档案的大语言模型（Large Language Model, LLM）生成，随后使用Qwen3-TTS进行语音合成。最终通过自动语音识别（Automatic Speech Recognition, ASR）进行质量验证，具体指标包括：词错误率（WER < 0.2）、说话人相似度（> 0.35）、内容忠实度与相关性检查。 ## 音频提取方法 python import tarfile, json # 列出指定服务器的所有tar文件 with open("shards/img73/tar_manifest.json") as f: manifest = json.load(f) # 提取指定tar包 with tarfile.open("shards/img73/tars/img73_family_life_part01.tar") as tf: tf.extractall("./extracted/")

提供机构：

Yifanfan

5,000+

优质数据集

54 个

任务类型

进入经典数据集