SkyKam/WildChat-1M
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/SkyKam/WildChat-1M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- question-answering
- text2text-generation
pretty_name: WildChat-1M
dataset_info:
features:
- name: conversation_hash
dtype: string
- name: model
dtype: string
- name: timestamp
dtype: timestamp[us, tz=UTC]
- name: conversation
list:
- name: content
dtype: string
- name: country
dtype: string
- name: hashed_ip
dtype: string
- name: header
struct:
- name: accept-language
dtype: string
- name: user-agent
dtype: string
- name: language
dtype: string
- name: redacted
dtype: bool
- name: role
dtype: string
- name: state
dtype: string
- name: timestamp
dtype: timestamp[us, tz=UTC]
- name: toxic
dtype: bool
- name: turn_identifier
dtype: int64
- name: turn
dtype: int64
- name: language
dtype: string
- name: openai_moderation
list:
- name: categories
struct:
- name: harassment
dtype: bool
- name: harassment/threatening
dtype: bool
- name: harassment_threatening
dtype: bool
- name: hate
dtype: bool
- name: hate/threatening
dtype: bool
- name: hate_threatening
dtype: bool
- name: self-harm
dtype: bool
- name: self-harm/instructions
dtype: bool
- name: self-harm/intent
dtype: bool
- name: self_harm
dtype: bool
- name: self_harm_instructions
dtype: bool
- name: self_harm_intent
dtype: bool
- name: sexual
dtype: bool
- name: sexual/minors
dtype: bool
- name: sexual_minors
dtype: bool
- name: violence
dtype: bool
- name: violence/graphic
dtype: bool
- name: violence_graphic
dtype: bool
- name: category_scores
struct:
- name: harassment
dtype: float64
- name: harassment/threatening
dtype: float64
- name: harassment_threatening
dtype: float64
- name: hate
dtype: float64
- name: hate/threatening
dtype: float64
- name: hate_threatening
dtype: float64
- name: self-harm
dtype: float64
- name: self-harm/instructions
dtype: float64
- name: self-harm/intent
dtype: float64
- name: self_harm
dtype: float64
- name: self_harm_instructions
dtype: float64
- name: self_harm_intent
dtype: float64
- name: sexual
dtype: float64
- name: sexual/minors
dtype: float64
- name: sexual_minors
dtype: float64
- name: violence
dtype: float64
- name: violence/graphic
dtype: float64
- name: violence_graphic
dtype: float64
- name: flagged
dtype: bool
- name: detoxify_moderation
list:
- name: identity_attack
dtype: float64
- name: insult
dtype: float64
- name: obscene
dtype: float64
- name: severe_toxicity
dtype: float64
- name: sexual_explicit
dtype: float64
- name: threat
dtype: float64
- name: toxicity
dtype: float64
- name: toxic
dtype: bool
- name: redacted
dtype: bool
- name: state
dtype: string
- name: country
dtype: string
- name: hashed_ip
dtype: string
- name: header
struct:
- name: accept-language
dtype: string
- name: user-agent
dtype: string
splits:
- name: train
num_bytes: 6844366367.030628
num_examples: 837989
download_size: 3360836020
dataset_size: 6844366367.030628
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
tags:
- instruction-finetuning
---
# Dataset Card for WildChat
## Dataset Description
- **Paper:** https://arxiv.org/abs/2405.01470
- **Interactive Search Tool:** https://wildvisualizer.com ([paper](https://arxiv.org/abs/2409.03753))
- **License:** [ODC-BY](https://opendatacommons.org/licenses/by/1-0/)
- **Language(s) (NLP):** multi-lingual
- **Point of Contact:** [Yuntian Deng](https://yuntiandeng.com/)
### Dataset Summary
WildChat is a collection of 1 million conversations between human users and ChatGPT, alongside demographic data, including state, country, hashed IP addresses, and request headers. We collected WildChat by offering online users free access to OpenAI's GPT-3.5 and GPT-4. In this version, 25.53% of the conversations come from the GPT-4 chatbot, while the rest come from the GPT-3.5 chatbot. The dataset contains a broad spectrum of user-chatbot interactions that are not previously covered by other instruction fine-tuning datasets: for example, interactions include ambiguous user requests, code-switching, topic-switching, political discussions, etc. WildChat can serve both as a dataset for instructional fine-tuning and as a valuable resource for studying user behaviors. Note that this version of the dataset only contains non-toxic user inputs/ChatGPT responses.
### Updates
**2024-10-17: Content Update.** Conversations flagged by [Niloofar Mireshghallah](https://homes.cs.washington.edu/~niloofar/) and her collaborators in ["Breaking News: Case Studies of Generative AI's Use in Journalism"](https://arxiv.org/abs/2406.13706) for containing PII or sensitive information have been removed from this version of the dataset.
**2024-07-22: Content Update.** All toxic conversations identified by the OpenAI Moderations API or Detoxify have been removed from this version of the dataset.
**2024-06-26: License Change.** We have updated the license of WildChat to [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). This change is retroactively applied to any previous downloads under the ImpACT license.
### Full Version with Toxic Content
For access to the full version of the WildChat dataset, which includes toxic conversations flagged by the OpenAI Moderations API or Detoxify, please refer to [WildChat-1M-Full](https://huggingface.co/datasets/allenai/WildChat-1M-Full). This version requires approval and justification for why toxic data is needed.
### Languages
68 languages were detected in WildChat.
### Personal and Sensitive Information
The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.
### Data Fields
- `conversation_hash` (string): The hash of each conversation's content. This is not a unique key, as different conversations with the same content will share the same hash. For unique identifiers, use `turn_identifier` within each turn.
- `model` (string): The underlying OpenAI model, such as gpt-3.5-turbo or gpt-4.
- `timestamp` (timestamp): The timestamp of the last turn in the conversation in UTC.
- `conversation` (list): A list of user/assistant utterances. Each utterance is a dictionary containing the `role` of the speaker (user or assistant), the `content` of the utterance, the detected `language` of the utterance, whether the content of the utterance is considered `toxic`, and whether PII has been detected and anonymized (`redacted`). For user turns, there's also the hashed IP address `hashed_ip` of the turn, the state `state` and country `country` inferred from the original IP address, and the request headers `header` (which might be useful for linking multiple conversations from the same user when used in conjunction with `hashed_ip`). For assistant turns, there's a field `timestamp` which is the time when the backend server receives the full response from ChatGPT. For both user and assistant turns, there's a unique idenifier `turn_identifier`.
- `turn` (int): The number of turns in the conversation. A turn refers to one round of user-assistant interaction.
- `language` (string): The language of the conversation. Note that this is the most frequently detected language in the utterances of the conversation.
- `openai_moderation` (list): A list of OpenAI Moderation results. Each element in the list corresponds to one utterance in the conversation. When the content of an utterance is an empty string, the corresponding moderation reult is set to be an empty dictionary.
- `detoxify_moderation` (list): A list of Detoxify results. Each element in the list corresponds to one utterance in the conversation. When the content of an utterance is an empty string, the corresponding Detoxify reult is set to be an empty dictionary.
- `toxic` (bool): Whether this conversation contains any utterances considered to be toxic by either OpenAI Moderation or Detoxify.
- `redacted` (bool): Whether this conversation contains any utterances in which PII is detected and anonymized.
- `state` (string): The state inferred from the most common IP address in the conversation. Its value is sometimes `None` when GeoIP2 does not identify the state of an IP address.
- `country` (string): The country inferred from the most common IP address in the conversation. Its value is sometimes `None` when GeoIP2 does not identify the country of an IP address.
- `hashed_ip` (string): The most common hashed IP address in the conversation.
- `header` (string): The request header containing information about operating system, browser versions, and accepted languages. This field might be useful for linking multiple conversations from the same user when used in conjunction with `hashed_ip`. Note that every turn in a conversation has the same header, as this is the way we linked turns into conversations.
### Empty User Inputs
This dataset includes a small subset of conversations where users submitted empty inputs, sometimes leading to hallucinated responses from the assistant. This issue, first noticed by @yuchenlin, arises from the design of our Huggingface chatbot used for data collection, which did not restrict the submission of empty inputs. As a result, users could submit without entering any text, causing the assistant to generate responses without any user prompts. This occurs in a small fraction of the dataset.
### Licensing Information
WildChat is now made available under the [**ODC-BY License**](https://opendatacommons.org/licenses/by/1-0/). This change is retroactively applied to any previous downloads under the ImpACT license.
### Citation Information
Please consider citing [our paper](https://arxiv.org/abs/2405.01470) if you find this dataset useful:
```
@inproceedings{
zhao2024wildchat,
title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild},
author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Bl8u7ZRlbM}
}
```
```
@misc{deng2024wildvisopensourcevisualizer,
title={WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild},
author={Yuntian Deng and Wenting Zhao and Jack Hessel and Xiang Ren and Claire Cardie and Yejin Choi},
year={2024},
eprint={2409.03753},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.03753},
}
```
### 数据集元信息
- 许可证:ODC-BY
- 样本规模类别:100万 < 样本数 < 1000万
- 任务类别:文本生成、问答、文本到文本生成
- 友好名称:WildChat-1M
- 数据集特征:
1. `conversation_hash`:字符串类型,单条对话内容的哈希值
2. `model`:字符串类型,所使用的OpenAI底层模型
3. `timestamp`:微秒级UTC时区时间戳,对应对话最后一轮交互的时间
4. `conversation`:列表类型,包含用户与助手的所有对话轮次,每个轮次为字典,包含`content`(发言内容)、`country`(从IP推断的用户所在国家)、`hashed_ip`(该轮次的哈希化IP地址)、`header`结构体(包含`accept-language`接受语言、`user-agent`用户代理)、`language`(发言检测语言)、`redacted`(是否已脱敏个人可识别信息)、`role`(发言角色:用户/助手)、`state`(从IP推断的用户所在州/地区)、`toxic`(是否被判定为有毒)、`turn_identifier`(轮次唯一标识符,int64类型)、`timestamp`(该轮次的时间戳,助手轮次为后端接收回复的时间)
5. `turn`:整数类型,对话总轮次数
6. `language`:字符串类型,对话的主要语言(各轮次中最频繁检测到的语言)
7. `openai_moderation`:列表类型,OpenAI审核结果列表,每个元素对应一轮发言,包含`categories`(违规分类布尔值)、`category_scores`(分类得分)、`flagged`(是否被标记)
8. `detoxify_moderation`:列表类型,Detoxify审核结果列表,包含`identity_attack`(身份攻击得分)、`insult`(辱骂得分)、`obscene`(淫秽内容得分)、`severe_toxicity`(严重毒性得分)、`sexual_explicit`(露骨性内容得分)、`threat`(威胁得分)、`toxicity`(毒性得分)
9. `toxic`:布尔类型,标记对话是否包含被OpenAI审核或Detoxify判定为有毒的发言
10. `redacted`:布尔类型,标记对话是否包含已脱敏个人可识别信息的发言
11. `state`:字符串类型,从对话中最常出现的IP地址推断的用户所在州/地区
12. `country`:字符串类型,从对话中最常出现的IP地址推断的用户所在国家
13. `hashed_ip`:字符串类型,对话中最常出现的哈希化IP地址
14. `header`:结构体类型,包含`accept-language`和`user-agent`
- 数据集划分:
- 训练集(train):字节数6844366367.030628,样本数837989
- 下载大小:3360836020字节
- 数据集总大小:6844366367.030628字节
- 配置信息:
- 默认配置:数据文件为`data/train-*`,对应训练集划分
- 标签:指令微调
# 《WildChat 数据集卡片》
## 数据集说明
- **相关论文**:"https://arxiv.org/abs/2405.01470"
- **交互式搜索工具**:"https://wildvisualizer.com"(配套论文:"https://arxiv.org/abs/2409.03753")
- **许可证**:[ODC-BY]("https://opendatacommons.org/licenses/by/1-0/")
- **自然语言涉及语言**:多语言
- **联系方式**:[邓云天(Yuntian Deng)]("https://yuntiandeng.com/")
### 数据集概述
WildChat 是一个包含百万级人类用户与 ChatGPT 对话的数据集,同时附带人口统计相关数据,包括用户所在州/地区、国家、哈希化 IP 地址与请求头信息。本数据集通过向在线用户免费开放 OpenAI 的 GPT-3.5 与 GPT-4 接口收集得到。在当前版本中,25.53% 的对话来自 GPT-4 聊天机器人,其余则来自 GPT-3.5 聊天机器人。本数据集涵盖了此前其他指令微调数据集未覆盖的多样化人机交互场景,例如模糊的用户请求、代码切换、话题切换、政治讨论等。WildChat 既可作为指令微调的数据集,也可作为研究用户行为的宝贵资源。请注意,本版本数据集仅包含非毒性的用户输入与 ChatGPT 回复内容。
### 更新记录
**2024年10月17日:内容更新**。本版本数据集已移除[Niloofar Mireshghallah]("https://homes.cs.washington.edu/~niloofar/")及其合作者在《突发新闻:生成式人工智能在新闻业中的应用案例研究》("https://arxiv.org/abs/2406.13706")中标记的包含个人可识别信息(Personal Identifiable Information, PII)或敏感信息的对话。
**2024年7月22日:内容更新**。本版本数据集已移除所有经 OpenAI 审核 API(OpenAI Moderations API)或 Detoxify 标记为有毒的对话。
**2024年6月26日:许可证变更**。我们已将 WildChat 的许可证更新为 [ODC-BY]("https://opendatacommons.org/licenses/by/1-0/"),该变更将追溯适用于此前以 ImpACT 许可证下载的所有版本。
### 含毒内容完整版本
若需获取包含经 OpenAI 审核 API 或 Detoxify 标记的有毒对话的 WildChat 完整版本数据集,请访问 [WildChat-1M-Full]("https://huggingface.co/datasets/allenai/WildChat-1M-Full")。申请该完整版本数据集需提供审批与使用有毒数据的理由说明。
### 语言覆盖
本数据集中共检测到68种语言。
### 个人与敏感信息处理
本数据集已由作者通过 Microsoft Presidio 工具与自定义手写规则完成去标识化处理。
### 数据字段说明
- `conversation_hash`(字符串类型):单条对话内容的哈希值。该字段并非唯一键,因为内容相同的不同对话会拥有相同的哈希值。若需唯一标识符,请使用每一轮对话中的`turn_identifier`字段。
- `model`(字符串类型):所使用的 OpenAI 底层模型,例如 gpt-3.5-turbo 或 gpt-4。
- `timestamp`(时间戳类型):对话中最后一轮交互的 UTC 时区时间戳。
- `conversation`(列表类型):用户与助手的对话轮次列表。每个轮次为一个字典,包含发言者角色(`role`,用户或助手)、发言内容(`content`)、检测到的发言语言(`language`)、发言内容是否被判定为有毒(`toxic`),以及是否已检测并脱敏个人可识别信息(`redacted`)。对于用户轮次,还包含该轮次的哈希 IP 地址(`hashed_ip`)、从原始 IP 地址推断出的用户所在州/地区(`state`)与国家(`country`),以及请求头信息(`header`,结合`hashed_ip`字段可用于关联同一用户的多条对话)。对于助手轮次,包含`timestamp`字段,即后端服务器收到 ChatGPT 完整回复的时间。用户与助手的每一轮对话均拥有唯一的`turn_identifier`(轮次标识符)。
- `turn`(整数类型):对话的总轮次数。一轮交互指一次完整的用户-助手问答循环。
- `language`(字符串类型):对话的整体语言。请注意,该字段为对话中各轮次发言检测出的最频繁使用的语言。
- `openai_moderation`(列表类型):OpenAI 审核结果列表。列表中的每个元素对应对话中的一轮发言。若某轮发言的内容为空字符串,则对应的审核结果为空字典。
- `detoxify_moderation`(列表类型):Detoxify 审核结果列表。列表中的每个元素对应对话中的一轮发言。若某轮发言的内容为空字符串,则对应的 Detoxify 结果为空字典。
- `toxic`(布尔类型):标记该对话是否包含被 OpenAI 审核或 Detoxify 判定为有毒的发言。
- `redacted`(布尔类型):标记该对话是否包含已检测并脱敏个人可识别信息的发言。
- `state`(字符串类型):从对话中最常出现的 IP 地址推断出的用户所在州/地区。当 GeoIP2 无法识别 IP 地址对应的州/地区时,该字段值可能为`None`。
- `country`(字符串类型):从对话中最常出现的 IP 地址推断出的用户所在国家。当 GeoIP2 无法识别 IP 地址对应的国家时,该字段值可能为`None`。
- `hashed_ip`(字符串类型):对话中最常出现的哈希化 IP 地址。
- `header`(字符串类型):请求头信息,包含操作系统、浏览器版本与接受语言等内容。结合`hashed_ip`字段可用于关联同一用户的多条对话。请注意,对话中的每一轮发言均拥有相同的请求头,这也是我们将各轮发言整合为单条对话的依据。
### 空用户输入情况
本数据集包含少量用户提交空输入的对话,此类情况有时会导致助手生成幻觉式回复。该问题最早由 @yuchenlin 发现,源于我们用于数据收集的 Huggingface 聊天机器人未限制空输入提交的设计缺陷。用户可在未输入任何文本的情况下提交请求,导致助手在无用户提示的情况下生成回复。此类情况仅占数据集的极小一部分。
### 许可证信息
WildChat 现已采用 [**ODC-BY 许可证**]("https://opendatacommons.org/licenses/by/1-0/") 进行开源,该变更将追溯适用于此前以 ImpACT 许可证下载的所有版本。
### 引用信息
若您认为本数据集对您的研究有所帮助,请引用[我们的论文]("https://arxiv.org/abs/2405.01470"):
@inproceedings{
zhao2024wildchat,
title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild},
author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Bl8u7ZRlbM}
}
@misc{deng2024wildvisopensourcevisualizer,
title={WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild},
author={Yuntian Deng and Wenting Zhao and Jack Hessel and Xiang Ren and Claire Cardie and Yejin Choi},
year={2024},
eprint={2409.03753},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.03753},
}
提供机构:
SkyKam



