Synthetic-Persona-Chat
收藏魔搭社区2026-05-22 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/Synthetic-Persona-Chat
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for SPC: Synthetic-Persona-Chat Dataset
Abstract from the paper introducing this dataset:
> High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during Turing test decreases from 17.2% to 8.8% over three iterations.
## Dataset Details
### Dataset Description
> We introduce the Synthetic-Persona-Chat dataset, a persona-based conversational dataset, consisting of two parts. The first part, consisting of 4,723 personas and 10,906 conversations, is an extension to Persona-Chat, which has the same user profile pairs as Persona-Chat but new synthetic conversations, with the same train/validation/test split as Persona-Chat. The second part is new synthetic personas and synthetic conversations based on that, consisting of 5,648 synthetic personas and 11,001 conversations. Synthetic-Persona-Chat is created using the Generator-Critic framework introduced in Faithful Persona-based Conversational Dataset Generation with Large Language Models.
Each conversation in the dataset has the following format:
```
{
"User 1 Persona":[],
"User 2 Persona":[],
"Conversation":[]
}
```
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Repository:** https://github.com/google-research-datasets/Synthetic-Persona-Chat/tree/main
- **Paper:** https://arxiv.org/abs/2312.10007
## Citation
**BibTeX:**
```@misc{jandaghi2023faithful,
title={Faithful Persona-based Conversational Dataset Generation with Large Language Models},
author={Pegah Jandaghi and XiangHai Sheng and Xinyi Bai and Jay Pujara and Hakim Sidahmed},
year={2023},
eprint={2312.10007},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
# SPC数据集卡片:合成角色对话数据集(Synthetic-Persona-Chat Dataset)
## 数据集配套论文摘要
> 可支撑与用户自然交互的AI模型研发的核心基础,是高质量的对话数据集。通过角色设定(persona)——即能够反映用户性格、动机与行为特征的个人特质维度——可推动聊天机器人与用户开展更深层次的互动。在多样化且全面的基于角色设定的对话数据集上训练自然语言处理(Natural Language Processing,NLP)模型,能够使对话模型与用户建立更深层次的情感联结,并持续吸引用户参与。本文借助大语言模型(Large Language Models,LLMs)的能力,从种子数据集出发构建了大规模高质量对话数据集。我们提出了生成器-评判器(Generator-Critic)架构框架,用于扩充初始数据集并提升对话质量:生成器为经提示工程引导以生成对话的大语言模型;评判器则由多个专业大语言模型组成,用于管控生成对话的质量,这些专家模型会筛选出优质生成对话,用于迭代优化生成器。我们发布了合成角色对话数据集(Synthetic-Persona-Chat,以下简称SPC),其基于角色对话数据集(Persona-Chat)生成了2万条对话。我们通过大量实验从多维度评估了SPC数据集与本文提出的生成框架的性能,结果显示,经过三轮迭代后,SPC数据集在图灵测试中相较于原始Persona-Chat数据集的落败率从17.2%降至8.8%。
## 数据集详情
### 数据集概述
> 合成角色对话数据集(Synthetic-Persona-Chat)是一款基于角色设定的对话数据集,包含两个部分。第一部分包含4723个角色设定与10906条对话,是对角色对话数据集(Persona-Chat)的扩充:该部分沿用了Persona-Chat的用户角色对,但生成了全新的合成对话,且训练集/验证集/测试集的划分方式与Persona-Chat完全一致。第二部分则包含全新的合成角色设定与基于该设定生成的合成对话,共计5648个合成角色设定与11001条对话。SPC数据集是通过本文提出的生成器-评判器架构生成的。
数据集中的每条对话均遵循如下格式:
json
{
"用户1角色设定": [],
"用户2角色设定": [],
"对话": []
}
### 数据集来源
- **代码仓库**:https://github.com/google-research-datasets/Synthetic-Persona-Chat/tree/main
- **研究论文**:https://arxiv.org/abs/2312.10007
## 引用格式
**BibTeX:**
bibtex
@misc{jandaghi2023faithful,
title={基于大语言模型的忠实角色对话数据集生成},
author={佩加·詹达吉(Pegah Jandaghi)、盛祥海(XiangHai Sheng)、白欣怡(Xinyi Bai)、杰伊·普贾拉(Jay Pujara)、哈基姆·西达赫梅德(Hakim Sidahmed)},
year={2023},
eprint={2312.10007},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
提供机构:
maas
创建时间:
2025-04-21



