WildChat
收藏魔搭社区2026-05-03 更新2024-06-22 收录
下载链接:
https://modelscope.cn/datasets/thomas/WildChat
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for WildChat
## Note: a newer version with 1 million conversations and demographic information can be found [here](https://huggingface.co/datasets/allenai/WildChat-1M).
## Dataset Description
- **Paper:** https://openreview.net/pdf?id=Bl8u7ZRlbM
- **License:** https://allenai.org/licenses/impact-lr
- **Language(s) (NLP):** multi-lingual
- **Point of Contact:** [Yuntian Deng](mailto:yuntiand@allenai.org)
### Dataset Summary
WildChat is a collection of 650K conversations between human users and ChatGPT. We collected WildChat by offering online users free access to OpenAI's GPT-3.5 and GPT-4. The dataset contains a broad spectrum of user-chatbot interactions that are not previously covered by other instruction fine-tuning datasets: for example, interactions include ambiguous user requests, code-switching, topic-switching, political discussions, etc. WildChat can serve both as a dataset for instructional fine-tuning and as a valuable resource for studying user behaviors. Note that this dataset contains toxic user inputs/ChatGPT responses. A nontoxic subset of this dataest can be found [here](https://huggingface.co/datasets/allenai/WildChat-nontoxic).
WildChat has been openly released under AI2's ImpACT license as a low-risk artifact. The use of WildChat to cause harm is strictly prohibited.
### Languages
66 languages were detected in WildChat.
### Personal and Sensitive Information
The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.
### Data Fields
- `conversation_id` (string): Each conversation has a unique id.
- `model` (string): The underlying OpenAI model, such as gpt-3.5-turbo or gpt-4.
- `timestamp` (timestamp): The timestamp of the last turn in the conversation in UTC.
- `conversation` (list): A list of user/assistant utterances. Each utterance is a dictionary containing the `role` of the speaker (user or assistant), the `content` of the utterance, the detected `language` of the utterance, whether the content of the utterance is considered `toxic`, and whether PII has been detected and anonymized (`redacted`).
- `turn` (int): The number of turns in the conversation. A turn refers to one round of user-assistant interaction.
- `language` (string): The language of the conversation. Note that this is the most frequently detected language in the utterances of the conversation.
- `openai_moderation` (list): A list of OpenAI Moderation results. Each element in the list corresponds to one utterance in the conversation.
- `detoxify_moderation` (list): A list of Detoxify results. Each element in the list corresponds to one utterance in the conversation.
- `toxic` (bool): Whether this conversation contains any utterances considered to be toxic by either OpenAI Moderation or Detoxify.
- `redacted` (bool): Whether this conversation contains any utterances in which PII is detected and anonymized.
### Empty User Inputs
This dataset includes a small subset of conversations where users submitted empty inputs, sometimes leading to hallucinated responses from the assistant. This issue, first noticed by @yuchenlin, arises from the design of our Huggingface chatbot used for data collection, which did not restrict the submission of empty inputs. As a result, users could submit without entering any text, causing the assistant to generate responses without any user prompts. This occurs in a small fraction of the dataset---12,405 out of 652,139 conversations.
### Licensing Information
WildChat is made available under the [**AI2
ImpACT License - Low Risk Artifacts ("LR
Agreement")**](https://allenai.org/licenses/impact-lr)
### Citation Information
Please consider citing [our paper](https://arxiv.org/abs/2405.01470) if you find this dataset useful:
```
@inproceedings{
zhao2024wildchat,
title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild},
author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Bl8u7ZRlbM}
}
```
# WildChat 数据集卡片
## 注意:包含100万条对话及人口统计信息的更新版本可在此处获取:https://huggingface.co/datasets/allenai/WildChat-1M
## 数据集描述
- **论文链接:** https://openreview.net/pdf?id=Bl8u7ZRlbM
- **许可协议:** https://allenai.org/licenses/impact-lr
- **自然语言处理所用语言:** 多语言
- **联络人:** [邓云天(Yuntian Deng)](mailto:yuntiand@allenai.org)
### 数据集摘要
WildChat是人类用户与ChatGPT(聊天生成预训练转换器)之间65万条对话的集合。我们通过向在线用户免费开放OpenAI的GPT-3.5(生成式预训练Transformer 3.5)和GPT-4(生成式预训练Transformer 4)模型,收集得到了本数据集。本数据集收录了此前其他指令微调数据集未覆盖的多样化用户-聊天机器人交互场景,例如模糊用户请求、语码转换、主题切换、政治讨论等。WildChat既可作为指令微调数据集使用,也可作为研究用户行为的宝贵资源。请注意,本数据集包含有毒的用户输入/ChatGPT回复。本数据集的无毒子集可在此处获取:https://huggingface.co/datasets/allenai/WildChat-nontoxic。
WildChat已以低风险制品(low-risk artifact)的形式通过AI2的ImpACT许可协议开源发布,严格禁止使用WildChat造成危害。
### 语言分布
WildChat中共检测到66种语言。
### 个人与敏感信息
作者已通过微软Presidio工具及人工编写的规则对数据进行了去标识化处理。
### 数据字段
- `conversation_id`(字符串类型):每条对话均配有唯一标识符。
- `model`(字符串类型):所使用的OpenAI底层模型,例如gpt-3.5-turbo或gpt-4。
- `timestamp`(时间戳类型):对话最后一轮交互的UTC时间戳。
- `conversation`(列表类型):由用户/助手发言组成的列表。每条发言为一个字典,包含发言者角色(`role`,用户或助手)、发言内容(`content`)、检测到的发言语言(`language`)、发言内容是否被判定为有毒(`toxic`),以及是否检测到个人可识别信息(PII,Personal Identifiable Information)并已进行匿名化处理(`redacted`)。
- `turn`(整数类型):对话的轮次数量,一轮即代表一次用户-助手交互。
- `language`(字符串类型):对话的语言。请注意,该值为对话中发言检测到的最频繁使用的语言。
- `openai_moderation`(列表类型):OpenAI内容审核(OpenAI Moderation)结果列表,列表中每个元素对应对话中的一条发言。
- `detoxify_moderation`(列表类型):Detoxify审核结果列表,列表中每个元素对应对话中的一条发言。
- `toxic`(布尔类型):标记该对话是否存在被OpenAI内容审核或Detoxify判定为有毒的发言。
- `redacted`(布尔类型):标记该对话是否存在已检测到个人可识别信息并完成匿名化处理的发言。
### 空用户输入
本数据集包含少量用户提交空输入的对话,此类情况有时会导致助手生成幻觉回复。该问题由@yuchenlin首先发现,源于我们用于数据收集的Huggingface聊天机器人未对空输入提交进行限制,导致用户可在未输入任何文本的情况下提交请求,进而使助手在无用户提示的情况下生成回复。此类情况仅占数据集的极小部分——652139条对话中仅12405条存在该问题。
### 许可协议信息
WildChat基于[**AI2 ImpACT低风险制品许可协议("LR协议")**](https://allenai.org/licenses/impact-lr)发布。
### 引用信息
若您认为本数据集对您的研究有所帮助,请引用[我们的论文](https://arxiv.org/abs/2405.01470):
@inproceedings{
zhao2024wildchat,
title={WildChat: 1M ChatGPT Interaction Logs in the Wild},
author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Bl8u7ZRlbM}
}
提供机构:
maas
创建时间:
2024-05-13
搜集汇总
数据集介绍

背景与挑战
背景概述
WildChat是一个包含65万条人类与ChatGPT交互记录的多语言数据集,涵盖多样化对话类型(如模糊请求、代码转换等),并标记了毒性内容,适用于指令微调和用户行为研究。数据经过去标识化处理,遵循AI2 ImpACT低风险许可协议。
以上内容由遇见数据集搜集并总结生成



