five

PIPPA

收藏
魔搭社区2026-01-10 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/PygmalionAI/PIPPA
下载链接
链接失效反馈
官方服务:
资源简介:
# PIPPA - Personal Interaction Pairs between People and AI It's been a long time coming, but we're proud to finally release the public portion of our conversational dataset to the public. **Personal Interaction Pairs between People and AI** (**PIPPA**) is a partially synthetic, community contributed and open-source conversational and roleplaying dataset generated from a subset of submitted logs to the Pygmalion project. This dataset is a subset of what we have received - it consists only of the valid conversational logs in which the submitter gave consent to redistribute to the public. Furthermore, we have done our best to redact or modify any personal information that could potentially be found within PIPPA. If you have found something within PIPPA which has not been redacted properly, please contact us via. email at `teargosling@pygmalion.chat` or `alpindale@pygmalion.chat` and we'll take care of it for you. You may contact us for any other purpose as well, including yelling at us for when the next model will be released. **⚠️ CAUTION: PIPPA contains conversations, themes and scenarios which can be considered "not safe for work" (NSFW) and/or heavily disturbing in nature. Models trained purely with PIPPA may have the tendency to generate X-rated output. You have been warned.** ## Dataset Summary PIPPA consists of just a little more than 1 million lines of dialogue spread out over 26,000 conversations between users of the popular chatbot website "Character.AI" and its large language model, obtained through a large community effort taking place over the course of several months. Tallying shows that over 1,000 unique personas simulating both real and fictional characters are represented within the dataset, allowing PIPPA and LLMs fine-tuned on it to adapt to many different roleplay domains. The dataset is represented with a JSONL file, with a singular JSON snippet representing one entire conversation. Every snippet contains the following pieces of data: - `submission_timestamp`: The Unix timestamp of when this particular conversation was submitted to the project, in milliseconds. - `categories`: The categories assigned to the character on the Character.AI website, if any were assigned. If no categories were assigned, it will be `null` - `bot_id`: The unique ID assigned to the specific character which the user was conversing with on the website. - `bot_name`: The name of the character. - `bot_greeting`: The introductory line of the character to the user. This is always the first utterance of dialogue in a conversation. - `bot_definitions`: Contains whatever was typed in the **Definitions** field in the character creator on the website. This usually consists of one or more example conversations between the user and the character designed to steer the model towards emulating the persona correctly. Bot definitions required a separate effort to gather, and thus may not be present for a specific persona - if this is the case, an empty string is provided. Because the defintions were written on Character.AI, this field usually follows Character.AI's unique formatting and should be preprocessed before feeding into any model - please see **Appendix A** of the paper for further details. - `bot_description`: Contains whatever was typed in the **Description** field in the character creator on the website. It usually consists of a few sentences which gives a brief overview of the character and any important details about them. - `conversation`: The conversation between the user and the model. This is represented as a list of dictionaries, each dictionary representing a single utterance and containing two key-value pairs: `message`, referring to the utterance itself and `is_human`, which designates whether the dialogue was generated by the user or the LLM. For further information about PIPPA, please refer to our [published paper](https://arxiv.org/abs/2308.05884) or contact us at the emails listed above. ## Files We publish PIPPA in multiple variants, each a singular JSONL file: - **pippa.jsonl**: The original dataset, almost exactly as submitted to us (barring any modifications resulting from the redaction of personally identifiable information). - **pippa_deduped.jsonl**: The 'cleaned' version of PIPPA, with duplicate conversations as well as any conversation with less than three turns removed from the dataset. **We recommend using this file.** - **pippa_metharme.jsonl**: A version of deduped PIPPA which is formatted in a similar way to our [Metharme instructional models](https://huggingface.co/PygmalionAI/metharme-13b), useful as an example to demonstrate how to properly format the PIPPA dataset. If you are using HuggingFace's `datasets` library, you can choose the file you wish to use by specifying the name of it (without extension) as an argument, like so: `dataset = load_dataset("PygmalionAI/PIPPA", 'pippa_deduped')`. The default value is `pippa_deduped`. Thank you for your patience, everyone! ## Citation If you're using our dataset, please consider citing our work: ```bibtex @misc{gosling2023pippa, title={PIPPA: A Partially Synthetic Conversational Dataset}, author={Tear Gosling and Alpin Dale and Yinhe Zheng}, year={2023}, eprint={2308.05884}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ___ Any relationship between the name of this dataset and any public personas is entirely and totally coincidential.

# PIPPA——人机个人交互对(Personal Interaction Pairs between People and AI) 历经许久筹备,我们终于自豪地向公众发布该对话数据集的公开版本。**人机个人交互对(Personal Interaction Pairs between People and AI,简称PIPPA)** 是一个半合成、由社区贡献的开源对话与角色扮演数据集,其数据源自Pygmalion项目提交的日志子集。 本数据集仅为我们收到的原始数据的子集,仅包含提交者同意向公众重新分发的有效对话日志。此外,我们已尽最大努力对PIPPA中可能存在的所有个人信息进行脱敏或修改。若您发现PIPPA中存在未妥善处理的内容,请通过以下邮箱联系我们:`teargosling@pygmalion.chat` 或 `alpindale@pygmalion.chat`,我们将为您处理。您也可因任何其他事由联系我们,包括询问下一个模型的发布时间等。 **⚠️ 注意:PIPPA包含可被归类为“不适合工作场所(Not Safe For Work,NSFW)”的对话、主题及场景,或包含性质极端令人不适的内容。仅使用PIPPA进行微调的模型可能会生成色情低俗内容。特此警告。** ## 数据集概述 PIPPA包含超过100万行对话内容,分布于26000余段对话中,这些对话源自热门聊天机器人网站Character.AI的用户与其大语言模型之间的交互,通过历时数月的大规模社区协作收集得到。统计显示,数据集中涵盖了1000余个模拟真实与虚构角色的独特人设,使得PIPPA以及基于其微调的大语言模型能够适配多种不同的角色扮演场景。 本数据集以JSONL格式存储,每一条JSON片段代表一段完整的对话。每条片段包含以下数据字段: - `submission_timestamp`:该段对话提交至项目的Unix时间戳(Unix timestamp),单位为毫秒。 - `categories`:Character.AI网站为该角色分配的分类标签(若有),若未分配任何分类,则取值为`null` - `bot_id`:网站上用户所对话的特定角色的唯一标识符。 - `bot_name`:角色名称。 - `bot_greeting`:角色向用户发送的开场白,始终为对话中的第一条发言。 - `bot_definitions`:网站角色创建页面中“定义(Definitions)”字段的内容,通常包含一段或多段用户与角色之间的示例对话,用于引导模型正确模仿该人设。角色定义需要额外的收集工作,因此部分人设可能不存在该字段,此时该字段将返回空字符串。由于该定义字段由Character.AI平台生成,其格式遵循Character.AI的专属规范,在输入模型前需进行预处理,详细信息请参见论文附录A。 - `bot_description`:网站角色创建页面中“描述(Description)”字段的内容,通常由数句话组成,用于简要介绍角色及其关键细节。 - `conversation`:用户与模型之间的对话内容,以字典列表的形式表示,每个字典代表单条发言,包含两个键值对:`message`为发言内容,`is_human`用于标记该发言是否由用户生成(若为`true`)或由大语言模型生成。 如需了解PIPPA的更多信息,请参阅我们的[已发表论文](https://arxiv.org/abs/2308.05884),或通过上述邮箱联系我们。 ## 数据集文件 我们以多种变体形式发布PIPPA,每种变体均为单个JSONL文件: - **pippa.jsonl**:原始数据集,除针对个人可识别信息的脱敏修改外,几乎与提交时的原始数据一致。 - **pippa_deduped.jsonl**:PIPPA的“清洗后”版本,移除了重复对话以及回合数少于3的对话。**我们推荐使用该文件。** - **pippa_metharme.jsonl**:经过去重的PIPPA变体,格式与我们的[Metharme指令微调模型](https://huggingface.co/PygmalionAI/metharme-13b)类似,可作为示例展示如何正确格式化PIPPA数据集。 若您使用HuggingFace的`datasets`库,可以通过指定文件名(不含扩展名)来选择所需的数据集文件,示例如下:`dataset = load_dataset("PygmalionAI/PIPPA", 'pippa_deduped')`。默认取值为`pippa_deduped`。 感谢各位的耐心等待! ## 引用 若您在工作中使用了本数据集,请考虑引用我们的研究成果: bibtex @misc{gosling2023pippa, title={PIPPA: A Partially Synthetic Conversational Dataset}, author={Tear Gosling and Alpin Dale and Yinhe Zheng}, year={2023}, eprint={2308.05884}, archivePrefix={arXiv}, primaryClass={cs.CL} } ___ 本数据集名称与任何公开角色人设之间的关联纯属巧合。
提供机构:
maas
创建时间:
2025-11-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作