five

Synthetic-Persona-Chat

收藏
魔搭社区2026-05-22 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/Synthetic-Persona-Chat
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for SPC: Synthetic-Persona-Chat Dataset Abstract from the paper introducing this dataset: > High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during Turing test decreases from 17.2% to 8.8% over three iterations. ## Dataset Details ### Dataset Description > We introduce the Synthetic-Persona-Chat dataset, a persona-based conversational dataset, consisting of two parts. The first part, consisting of 4,723 personas and 10,906 conversations, is an extension to Persona-Chat, which has the same user profile pairs as Persona-Chat but new synthetic conversations, with the same train/validation/test split as Persona-Chat. The second part is new synthetic personas and synthetic conversations based on that, consisting of 5,648 synthetic personas and 11,001 conversations. Synthetic-Persona-Chat is created using the Generator-Critic framework introduced in Faithful Persona-based Conversational Dataset Generation with Large Language Models. Each conversation in the dataset has the following format: ``` { "User 1 Persona":[], "User 2 Persona":[], "Conversation":[] } ``` ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** https://github.com/google-research-datasets/Synthetic-Persona-Chat/tree/main - **Paper:** https://arxiv.org/abs/2312.10007 ## Citation **BibTeX:** ```@misc{jandaghi2023faithful, title={Faithful Persona-based Conversational Dataset Generation with Large Language Models}, author={Pegah Jandaghi and XiangHai Sheng and Xinyi Bai and Jay Pujara and Hakim Sidahmed}, year={2023}, eprint={2312.10007}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# SPC数据集卡片:合成角色对话数据集(Synthetic-Persona-Chat Dataset) ## 数据集配套论文摘要 > 可支撑与用户自然交互的AI模型研发的核心基础,是高质量的对话数据集。通过角色设定(persona)——即能够反映用户性格、动机与行为特征的个人特质维度——可推动聊天机器人与用户开展更深层次的互动。在多样化且全面的基于角色设定的对话数据集上训练自然语言处理(Natural Language Processing,NLP)模型,能够使对话模型与用户建立更深层次的情感联结,并持续吸引用户参与。本文借助大语言模型(Large Language Models,LLMs)的能力,从种子数据集出发构建了大规模高质量对话数据集。我们提出了生成器-评判器(Generator-Critic)架构框架,用于扩充初始数据集并提升对话质量:生成器为经提示工程引导以生成对话的大语言模型;评判器则由多个专业大语言模型组成,用于管控生成对话的质量,这些专家模型会筛选出优质生成对话,用于迭代优化生成器。我们发布了合成角色对话数据集(Synthetic-Persona-Chat,以下简称SPC),其基于角色对话数据集(Persona-Chat)生成了2万条对话。我们通过大量实验从多维度评估了SPC数据集与本文提出的生成框架的性能,结果显示,经过三轮迭代后,SPC数据集在图灵测试中相较于原始Persona-Chat数据集的落败率从17.2%降至8.8%。 ## 数据集详情 ### 数据集概述 > 合成角色对话数据集(Synthetic-Persona-Chat)是一款基于角色设定的对话数据集,包含两个部分。第一部分包含4723个角色设定与10906条对话,是对角色对话数据集(Persona-Chat)的扩充:该部分沿用了Persona-Chat的用户角色对,但生成了全新的合成对话,且训练集/验证集/测试集的划分方式与Persona-Chat完全一致。第二部分则包含全新的合成角色设定与基于该设定生成的合成对话,共计5648个合成角色设定与11001条对话。SPC数据集是通过本文提出的生成器-评判器架构生成的。 数据集中的每条对话均遵循如下格式: json { "用户1角色设定": [], "用户2角色设定": [], "对话": [] } ### 数据集来源 - **代码仓库**:https://github.com/google-research-datasets/Synthetic-Persona-Chat/tree/main - **研究论文**:https://arxiv.org/abs/2312.10007 ## 引用格式 **BibTeX:** bibtex @misc{jandaghi2023faithful, title={基于大语言模型的忠实角色对话数据集生成}, author={佩加·詹达吉(Pegah Jandaghi)、盛祥海(XiangHai Sheng)、白欣怡(Xinyi Bai)、杰伊·普贾拉(Jay Pujara)、哈基姆·西达赫梅德(Hakim Sidahmed)}, year={2023}, eprint={2312.10007}, archivePrefix={arXiv}, primaryClass={cs.CL} }
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作