Synthetic-Persona-Chat

Name: Synthetic-Persona-Chat
Creator: maas
Published: 2026-05-22 13:58:29
License: 暂无描述

魔搭社区2026-05-22 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/Synthetic-Persona-Chat

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for SPC: Synthetic-Persona-Chat Dataset Abstract from the paper introducing this dataset: > High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during Turing test decreases from 17.2% to 8.8% over three iterations. ## Dataset Details ### Dataset Description > We introduce the Synthetic-Persona-Chat dataset, a persona-based conversational dataset, consisting of two parts. The first part, consisting of 4,723 personas and 10,906 conversations, is an extension to Persona-Chat, which has the same user profile pairs as Persona-Chat but new synthetic conversations, with the same train/validation/test split as Persona-Chat. The second part is new synthetic personas and synthetic conversations based on that, consisting of 5,648 synthetic personas and 11,001 conversations. Synthetic-Persona-Chat is created using the Generator-Critic framework introduced in Faithful Persona-based Conversational Dataset Generation with Large Language Models. Each conversation in the dataset has the following format: ``` { "User 1 Persona":[], "User 2 Persona":[], "Conversation":[] } ``` ### Dataset Sources  - **Repository:** https://github.com/google-research-datasets/Synthetic-Persona-Chat/tree/main - **Paper:** https://arxiv.org/abs/2312.10007 ## Citation **BibTeX:** ```@misc{jandaghi2023faithful, title={Faithful Persona-based Conversational Dataset Generation with Large Language Models}, author={Pegah Jandaghi and XiangHai Sheng and Xinyi Bai and Jay Pujara and Hakim Sidahmed}, year={2023}, eprint={2312.10007}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# SPC数据集卡片：合成角色对话数据集（Synthetic-Persona-Chat Dataset） ## 数据集配套论文摘要 > 可支撑与用户自然交互的AI模型研发的核心基础，是高质量的对话数据集。通过角色设定（persona）——即能够反映用户性格、动机与行为特征的个人特质维度——可推动聊天机器人与用户开展更深层次的互动。在多样化且全面的基于角色设定的对话数据集上训练自然语言处理（Natural Language Processing，NLP）模型，能够使对话模型与用户建立更深层次的情感联结，并持续吸引用户参与。本文借助大语言模型（Large Language Models，LLMs）的能力，从种子数据集出发构建了大规模高质量对话数据集。我们提出了生成器-评判器（Generator-Critic）架构框架，用于扩充初始数据集并提升对话质量：生成器为经提示工程引导以生成对话的大语言模型；评判器则由多个专业大语言模型组成，用于管控生成对话的质量，这些专家模型会筛选出优质生成对话，用于迭代优化生成器。我们发布了合成角色对话数据集（Synthetic-Persona-Chat，以下简称SPC），其基于角色对话数据集（Persona-Chat）生成了2万条对话。我们通过大量实验从多维度评估了SPC数据集与本文提出的生成框架的性能，结果显示，经过三轮迭代后，SPC数据集在图灵测试中相较于原始Persona-Chat数据集的落败率从17.2%降至8.8%。 ## 数据集详情 ### 数据集概述 > 合成角色对话数据集（Synthetic-Persona-Chat）是一款基于角色设定的对话数据集，包含两个部分。第一部分包含4723个角色设定与10906条对话，是对角色对话数据集（Persona-Chat）的扩充：该部分沿用了Persona-Chat的用户角色对，但生成了全新的合成对话，且训练集/验证集/测试集的划分方式与Persona-Chat完全一致。第二部分则包含全新的合成角色设定与基于该设定生成的合成对话，共计5648个合成角色设定与11001条对话。SPC数据集是通过本文提出的生成器-评判器架构生成的。数据集中的每条对话均遵循如下格式： json { "用户1角色设定": [], "用户2角色设定": [], "对话": [] } ### 数据集来源 - **代码仓库**：https://github.com/google-research-datasets/Synthetic-Persona-Chat/tree/main - **研究论文**：https://arxiv.org/abs/2312.10007 ## 引用格式 **BibTeX：** bibtex @misc{jandaghi2023faithful, title={基于大语言模型的忠实角色对话数据集生成}, author={佩加·詹达吉（Pegah Jandaghi）、盛祥海（XiangHai Sheng）、白欣怡（Xinyi Bai）、杰伊·普贾拉（Jay Pujara）、哈基姆·西达赫梅德（Hakim Sidahmed）}, year={2023}, eprint={2312.10007}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

maas

创建时间：

2025-04-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集