five

google/Synthetic-Persona-Chat

收藏
Hugging Face2024-03-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/google/Synthetic-Persona-Chat
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text2text-generation language: - en size_categories: - 10K<n<100K --- # Dataset Card for SPC: Synthetic-Persona-Chat Dataset Abstract from the paper introducing this dataset: > High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during Turing test decreases from 17.2% to 8.8% over three iterations. ## Dataset Details ### Dataset Description > We introduce the Synthetic-Persona-Chat dataset, a persona-based conversational dataset, consisting of two parts. The first part, consisting of 4,723 personas and 10,906 conversations, is an extension to Persona-Chat, which has the same user profile pairs as Persona-Chat but new synthetic conversations, with the same train/validation/test split as Persona-Chat. The second part is new synthetic personas and synthetic conversations based on that, consisting of 5,648 synthetic personas and 11,001 conversations. Synthetic-Persona-Chat is created using the Generator-Critic framework introduced in Faithful Persona-based Conversational Dataset Generation with Large Language Models. Each conversation in the dataset has the following format: ``` { "User 1 Persona":[], "User 2 Persona":[], "Conversation":[] } ``` ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** https://github.com/google-research-datasets/Synthetic-Persona-Chat/tree/main - **Paper:** https://arxiv.org/abs/2312.10007 ## Citation **BibTeX:** ```@misc{jandaghi2023faithful, title={Faithful Persona-based Conversational Dataset Generation with Large Language Models}, author={Pegah Jandaghi and XiangHai Sheng and Xinyi Bai and Jay Pujara and Hakim Sidahmed}, year={2023}, eprint={2312.10007}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
google
原始信息汇总

数据集概述

数据集来源

  • 本数据集来源于一篇学术论文。

数据集介绍

  • 数据集的详细介绍可在引入该数据集的论文中找到。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作