mpasila/PIPPA-ShareGPT-formatted-named
收藏Hugging Face2024-04-15 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/mpasila/PIPPA-ShareGPT-formatted-named
下载链接
链接失效反馈官方服务:
资源简介:
---
license: agpl-3.0
task_categories:
- text-generation
language:
- en
tags:
- not-for-all-audiences
- conversational
- roleplay
- custom-format
- a.
pretty_name: PIPPA - Personal Interaction Pairs Between People and AI
size_categories:
- 10K<n<100K
viewer: false
---
This is modified version of [KaraKaraWitch/PIPPA-ShareGPT-formatted](https://huggingface.co/datasets/KaraKaraWitch/PIPPA-ShareGPT-formatted). I added randomized names for each conversation and moved the description of the character into a system message and some other cleanup.
The randomized names might be causing some problems like the bot using incorrect pronouns etc.
# Original dataset card:
# KaraKaraWitch/PIPPA-IHaveNeverFeltNeedToSend
```
I've never felt the need to send a photo of my <REDACTED>
To a stranger on the Internet
```
The following is the original description for PIPPA. [Consider downloading the original dataset over here!](https://huggingface.co/datasets/PygmalionAI/PIPPA)
---
# PIPPA - Personal Interaction Pairs between People and AI
It's been a long time coming, but we're proud to finally release the public portion of our conversational dataset to the public. **Personal Interaction Pairs between People and AI** (**PIPPA**) is a partially synthetic, community contributed and open-source conversational and roleplaying dataset generated from a subset of submitted logs to the Pygmalion project.
This dataset is a subset of what we have received - it consists only of the valid conversational logs in which the submitter gave consent to redistribute to the public. Furthermore, we have done our best to redact or modify any personal information that could potentially be found within PIPPA. If you have found something within PIPPA which has not been redacted properly, please contact us via. email at `teargosling@pygmalion.chat` or `alpindale@pygmalion.chat` and we'll take care of it for you. You may contact us for any other purpose as well, including yelling at us for when the next model will be released.
**⚠️ CAUTION: PIPPA contains conversations, themes and scenarios which can be considered "not safe for work" (NSFW) and/or heavily disturbing in nature. Models trained purely with PIPPA may have the tendency to generate X-rated output. You have been warned.**
## Dataset Summary
PIPPA consists of just a little more than 1 million lines of dialogue spread out over 26,000 conversations between users of the popular chatbot website "Character.AI" and its large language model, obtained through a large community effort taking place over the course of several months. Tallying shows that over 1,000 unique personas simulating both real and fictional characters are represented within the dataset, allowing PIPPA and LLMs fine-tuned on it to adapt to many different roleplay domains.
The dataset is represented with a JSONL file, with a singular JSON snippet representing one entire conversation. Every snippet contains the following pieces of data:
- `submission_timestamp`: The Unix timestamp of when this particular conversation was submitted to the project, in milliseconds.
- `categories`: The categories assigned to the character on the Character.AI website, if any were assigned. If no categories were assigned, it will be `null`
- `bot_id`: The unique ID assigned to the specific character which the user was conversing with on the website.
- `bot_name`: The name of the character.
- `bot_greeting`: The introductory line of the character to the user. This is always the first utterance of dialogue in a conversation.
- `bot_definitions`: Contains whatever was typed in the **Definitions** field in the character creator on the website. This usually consists of one or more example conversations between the user and the character designed to steer the model towards emulating the persona correctly. Bot definitions required a separate effort to gather, and thus may not be present for a specific persona - if this is the case, an empty string is provided. Because the defintions were written on Character.AI, this field usually follows Character.AI's unique formatting and should be preprocessed before feeding into any model - please see **Appendix A** of the paper for further details.
- `bot_description`: Contains whatever was typed in the **Description** field in the character creator on the website. It usually consists of a few sentences which gives a brief overview of the character and any important details about them.
- `conversation`: The conversation between the user and the model. This is represented as a list of dictionaries, each dictionary representing a single utterance and containing two key-value pairs: `message`, referring to the utterance itself and `is_human`, which designates whether the dialogue was generated by the user or the LLM.
For further information about PIPPA, please refer to our [published paper](https://arxiv.org/abs/2308.05884) or contact us at the emails listed above.
## Files
We publish PIPPA in multiple variants, each a singular JSONL file:
- **pippa.jsonl**: The original dataset, almost exactly as submitted to us (barring any modifications resulting from the redaction of personally identifiable information).
- **pippa_deduped.jsonl**: The 'cleaned' version of PIPPA, with duplicate conversations as well as any conversation with less than three turns removed from the dataset. **We recommend using this file.**
- **pippa_metharme.jsonl**: A version of deduped PIPPA which is formatted in a similar way to our [Metharme instructional models](https://huggingface.co/PygmalionAI/metharme-13b), useful as an example to demonstrate how to properly format the PIPPA dataset.
If you are using HuggingFace's `datasets` library, you can choose the file you wish to use by specifying the name of it (without extension) as an argument, like so: `dataset = load_dataset("PygmalionAI/PIPPA", 'pippa_deduped')`. The default value is `pippa_deduped`.
Thank you for your patience, everyone!
## Citation
If you're using our dataset, please consider citing our work:
```bibtex
@misc{gosling2023pippa,
title={PIPPA: A Partially Synthetic Conversational Dataset},
author={Tear Gosling and Alpin Dale and Yinhe Zheng},
year={2023},
eprint={2308.05884},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
___
Any relationship between the name of this dataset and any public personas is entirely and totally coincidential.
提供机构:
mpasila
原始信息汇总
PIPPA - Personal Interaction Pairs between People and AI
数据集概述
PIPPA(Personal Interaction Pairs between People and AI)是一个部分合成、社区贡献的开放源代码对话和角色扮演数据集,源自Pygmalion项目提交日志的一个子集。该数据集仅包含提交者同意重新分发的有效对话日志,并已尽力删除或修改任何可能包含的个人识别信息。
数据集内容
PIPPA包含超过100万行对话,分布在26,000个对话中,涉及用户与流行聊天机器人网站“Character.AI”及其大型语言模型之间的交流。数据集中有超过1,000个独特的角色,包括真实和虚构角色,使PIPPA和基于其微调的LLM能够适应多种角色扮演领域。
数据格式
数据集以JSONL文件格式表示,每个JSON片段代表一个完整的对话,包含以下数据:
submission_timestamp:对话提交到项目的Unix时间戳(以毫秒为单位)。categories:在Character.AI网站上分配给角色的类别(如果有)。bot_id:分配给网站上与用户交谈的特定角色的唯一ID。bot_name:角色的名称。bot_greeting:角色对用户的开场白,这是对话中的第一个发言。bot_definitions:在角色创建者中Definitions字段中输入的内容,通常包含用户和角色之间的示例对话,以正确引导模型模拟角色。bot_description:在角色创建者中Description字段中输入的内容,通常包含对角色的简要概述和重要细节。conversation:用户和模型之间的对话,表示为字典列表,每个字典代表一个发言,包含message(发言本身)和is_human(指定发言是由用户还是LLM生成)。
文件版本
数据集发布为多个JSONL文件:
- pippa.jsonl:原始数据集,几乎完全按照提交给我们的方式(除了个人识别信息的修改)。
- pippa_deduped.jsonl:经过清理的PIPPA版本,删除了重复对话和少于三回合的对话。推荐使用此文件。
- pippa_metharme.jsonl:格式类似于Metharme instructional models的deduped PIPPA版本,可用作正确格式化PIPPA数据集的示例。
注意事项
PIPPA包含可能被视为“不适合工作场所”(NSFW)和/或性质严重令人不安的对话、主题和场景。使用纯PIPPA训练的模型可能会生成成人内容输出。
引用
如果使用此数据集,请考虑引用我们的工作: bibtex @misc{gosling2023pippa, title={PIPPA: A Partially Synthetic Conversational Dataset}, author={Tear Gosling and Alpin Dale and Yinhe Zheng}, year={2023}, eprint={2308.05884}, archivePrefix={arXiv}, primaryClass={cs.CL} }
搜集汇总
数据集介绍

构建方式
在对话生成领域,构建高质量数据集是推动模型泛化能力的关键。PIPPA数据集通过社区贡献与部分合成的方式精心构建,其原始对话来源于Character.AI网站的用户交互日志。数据收集过程中,仅选取提交者同意公开且经过有效性验证的对话,并严格进行了个人信息脱敏处理,确保隐私安全。数据集以JSONL格式组织,每条记录代表一次完整对话,包含时间戳、角色类别、角色定义及对话序列等结构化字段,为后续模型训练提供了丰富且规范的语料基础。
特点
作为面向角色扮演与开放域对话的数据集,PIPPA展现出鲜明的多维度特征。其涵盖超过26,000次对话与100万行文本,涉及千余种独特角色,覆盖真实与虚构人物,支持多样化的情境模拟。数据集特别标注了角色描述、问候语及定义字段,这些元数据有助于模型深入理解角色背景与对话语境。值得注意的是,数据内容包含成人向主题,使用时需谨慎评估其适用场景。此外,数据集提供去重版本与格式化变体,增强了数据的洁净度与易用性。
使用方法
在自然语言处理研究中,PIPPA数据集为对话模型的训练与评估提供了实用资源。用户可通过HuggingFace的datasets库直接加载,并选择原始版、去重版或预格式化版本以适应不同实验需求。数据中的对话序列以人类与AI交替发言的形式呈现,每条发言均标注发言者身份,便于构建有监督的对话生成任务。使用前建议参考附录中的格式说明,对角色定义等字段进行必要预处理,以优化模型输入。该数据集适用于角色扮演对话生成、对话系统适应性训练等研究方向,但需注意其内容敏感性,确保符合伦理规范。
背景与挑战
背景概述
在人工智能对话系统快速发展的背景下,PIPPA数据集于2023年由PygmalionAI团队发布,旨在提供高质量的人机交互对话数据。该数据集源自Character.AI平台用户提交的对话日志,经过社区贡献与部分合成处理,涵盖了超过26,000个对话和1000多个独特角色,专注于角色扮演与开放域对话生成任务。其核心研究问题在于如何利用真实用户交互数据提升大型语言模型在个性化对话模拟中的表现,对推动自然语言处理领域中的对话生成与角色适应性研究具有显著影响力。
当前挑战
PIPPA数据集面临的挑战主要集中于两个方面:在领域问题层面,它旨在解决开放域角色扮演对话生成的复杂性,包括如何准确模拟多样化角色的人格特质、维持对话连贯性以及处理潜在的不安全内容,这对模型的泛化与安全控制能力提出了较高要求;在构建过程中,挑战涉及大规模对话数据的收集与清理,例如确保用户隐私信息的有效脱敏、去除重复或低质量对话,以及处理数据格式不一致问题,如原始Character.AI平台特有的定义字段格式化需求,这些步骤增加了数据集构建的复杂性与可靠性风险。
常用场景
经典使用场景
在自然语言处理领域,PIPPA数据集以其丰富的角色扮演对话内容,为大型语言模型的微调提供了关键资源。该数据集通过模拟用户与AI角色之间的互动,涵盖了多样化的对话场景和人物设定,使得研究者能够利用这些数据训练模型以生成更具上下文适应性和角色一致性的回复。这种应用不仅提升了模型在开放域对话中的表现,还为个性化交互系统的开发奠定了基础。
实际应用
在实际应用中,PIPPA数据集被广泛用于增强聊天机器人和虚拟助手的交互能力。例如,在娱乐和教育领域,基于该数据训练的模型可以模拟特定角色或历史人物,提供沉浸式的对话体验。同时,企业也可利用这些数据优化客户服务系统,使其能够处理更复杂、个性化的用户查询,提升用户体验。
衍生相关工作
围绕PIPPA数据集,已衍生出多项经典研究工作,包括基于其微调的开源对话模型,如PygmalionAI系列。这些工作进一步探索了数据清洗、格式转换和伦理处理技术,例如创建去重版本和Metharme格式变体。相关研究还扩展至多模态对话生成,推动了合成数据在AI训练中的标准化应用。
以上内容由遇见数据集搜集并总结生成



