Yifanfan/Persona-Dialogue
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Yifanfan/Persona-Dialogue
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
tags:
- audio
- dialogue
- multi-turn
- tts
- persona
size_categories:
- 10K<n<100K
---
# Persona-Dialogue Dataset
Multi-turn persona-driven dialogue dataset with synthesized speech audio.
## Overview
- **Total conversations**: 21561
- **Total turns**: 165871
- **Total audio duration**: 498.0 hours
- **Audio format**: WAV, mono, 24kHz
- **Language**: English
- **Scenarios**: 20
## Scenarios
| Scenario | Groups |
|----------|--------|
| Family life | 7497 |
| School classroom | 2115 |
| Company meeting | 1776 |
| Restaurant | 1201 |
| Travel group | 983 |
| Friends gathering | 963 |
| Library/Bookstore | 962 |
| Stadium/Sports game | 915 |
| Shopping center | 898 |
| Concert/Music festival | 889 |
| Technology exhibition | 360 |
| Gym | 357 |
| Art gallery | 353 |
| Cafe | 348 |
| Sports club | 332 |
| Public transportation | 331 |
| Park | 328 |
| Amusement park | 327 |
| Hospital | 318 |
| Pet shop | 308 |
## Per-Server Breakdown
| Server | Groups | Turns | Duration |
|--------|--------|-------|----------|
| img73 | 7167 | 55373 | 162.0h |
| img75 | 5261 | 40546 | 123.4h |
| img77 | 2327 | 17603 | 57.5h |
| img90 | 6806 | 52349 | 155.0h |
## Data Structure
Audio is stored as tar archives under `shards/{server}/tars/`. Each tar contains
`audio/{server}/{group_id}/*.wav` preserving the original directory structure.
### Turn-level fields (`all_turns.jsonl`)
| Field | Description |
|-------|-------------|
| `id` | Unique turn ID |
| `conversation_id` | Unique conversation ID |
| `turn_id` | 1-indexed turn number |
| `scenario` | Dialogue scenario |
| `topic` | Conversation topic |
| `speaker` | Speaker name |
| `role` | `user` or `assistant` |
| `text` | Utterance text |
| `audio` | Path to WAV inside tar: `audio/{server}/{group}/{file}.wav` |
| `source_server` | Source server ID |
### Group-level fields (`all_groups.jsonl`)
| Field | Description |
|-------|-------------|
| `conversation_id` | Unique conversation ID |
| `scenario` | Dialogue scenario |
| `topic` | Conversation topic |
| `num_turns` | Number of turns |
| `duration_s` | Total audio duration (seconds) |
| `profiles` | Speaker persona profiles |
| `dialogue` | Full dialogue |
| `audio_paths` | List of audio paths inside tar |
## Generation Pipeline
Dialogues generated via LLM with persona profiles, synthesized using Qwen3-TTS.
Quality validated through ASR (WER < 0.2), speaker similarity (> 0.35),
faithfulness and relevance checks.
## Extracting Audio
```python
import tarfile, json
# List all tar files for a server
with open("shards/img73/tar_manifest.json") as f:
manifest = json.load(f)
# Extract a specific tar
with tarfile.open("shards/img73/tars/img73_family_life_part01.tar") as tf:
tf.extractall("./extracted/")
```
许可证:知识共享署名4.0国际许可协议(CC BY 4.0)
语言:
- 英语
标签:
- 音频
- 对话
- 多轮
- 文本转语音(Text-to-Speech, TTS)
- 人设(Persona)
规模类别:
- 10000 < 样本数 < 100000
# 人设驱动对话数据集(Persona-Dialogue Dataset)
包含合成语音音频的多轮人设驱动对话数据集。
## 概览
- **总对话数**:21561
- **总轮次**:165871
- **总音频时长**:498.0 小时
- **音频格式**:WAV、单声道、24kHz
- **语言**:英语
- **场景数量**:20
## 对话场景
| 场景名称 | 对话组数 |
|----------|--------|
| 家庭生活 | 7497 |
| 学校课堂 | 2115 |
| 公司会议 | 1776 |
| 餐厅 | 1201 |
| 旅行团 | 983 |
| 好友聚会 | 963 |
| 图书馆/书店 | 962 |
| 体育场/体育赛事 | 915 |
| 购物中心 | 898 |
| 演唱会/音乐节 | 889 |
| 科技展会 | 360 |
| 健身房 | 357 |
| 美术馆 | 353 |
| 咖啡馆 | 348 |
| 体育俱乐部 | 332 |
| 公共交通 | 331 |
| 公园 | 328 |
| 游乐园 | 327 |
| 医院 | 318 |
| 宠物店 | 308 |
## 按服务器拆分详情
| 服务器ID | 对话组数 | 总轮次 | 总时长 |
|--------|--------|-------|----------|
| img73 | 7167 | 55373 | 162.0h |
| img75 | 5261 | 40546 | 123.4h |
| img77 | 2327 | 17603 | 57.5h |
| img90 | 6806 | 52349 | 155.0h |
## 数据结构
音频存储于 `shards/{server}/tars/` 路径下的tar归档文件中。每个tar包包含 `audio/{server}/{group_id}/*.wav`,完整保留原始目录结构。
### 轮次级字段(`all_turns.jsonl`)
| 字段名 | 字段说明 |
|-------|-------------|
| `id` | 唯一轮次标识符 |
| `conversation_id` | 唯一对话标识符 |
| `turn_id` | 从1开始计数的轮次编号 |
| `scenario` | 对话所属场景 |
| `topic` | 对话主题 |
| `speaker` | 说话人姓名 |
| `role` | 角色类型,可选值为`user`(用户)或`assistant`(助手) |
| `text` | 话语文本 |
| `audio` | tar包内WAV文件路径,格式为 `audio/{server}/{group}/{file}.wav` |
| `source_server` | 源服务器ID |
### 对话组级字段(`all_groups.jsonl`)
| 字段名 | 字段说明 |
|-------|-------------|
| `conversation_id` | 唯一对话标识符 |
| `scenario` | 对话所属场景 |
| `topic` | 对话主题 |
| `num_turns` | 总轮次数量 |
| `duration_s` | 总音频时长(单位:秒) |
| `profiles` | 说话人人设档案 |
| `dialogue` | 完整对话内容 |
| `audio_paths` | tar包内的音频路径列表 |
## 生成流程
对话通过带人设档案的大语言模型(Large Language Model, LLM)生成,随后使用Qwen3-TTS进行语音合成。最终通过自动语音识别(Automatic Speech Recognition, ASR)进行质量验证,具体指标包括:词错误率(WER < 0.2)、说话人相似度(> 0.35)、内容忠实度与相关性检查。
## 音频提取方法
python
import tarfile, json
# 列出指定服务器的所有tar文件
with open("shards/img73/tar_manifest.json") as f:
manifest = json.load(f)
# 提取指定tar包
with tarfile.open("shards/img73/tars/img73_family_life_part01.tar") as tf:
tf.extractall("./extracted/")
提供机构:
Yifanfan



