five

WildChat-4.8M

收藏
魔搭社区2025-11-27 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/allenai/WildChat-4.8M
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for WildChat-4.8M ## Dataset Description - **Interactive Search Tool:** https://wildvisualizer.com - **WildChat paper:** https://arxiv.org/abs/2405.01470 - **WildVis paper:** https://arxiv.org/abs/2409.03753 - **Point of Contact:** [Yuntian Deng](https://yuntiandeng.com/) ## Dataset Description - **Interactive Search Tool:** https://wildvisualizer.com - **WildChat paper:** https://arxiv.org/abs/2405.01470 - **WildVis paper:** https://arxiv.org/abs/2409.03753 - **Point of Contact:** [Yuntian Deng](https://yuntiandeng.com/) ### Dataset Summary WildChat-4.8M is a collection of **3,199,860 conversations** between human users and ChatGPT. This version **only contains non-toxic user inputs and ChatGPT responses**, as flagged by the OpenAI Moderations API or Detoxify. It is derived from the [WildChat-4.8M-Full](https://huggingface.co/datasets/allenai/WildChat-4.8M-Full) dataset (4,743,336 conversations after minors removal from the original **4,804,190** conversations) by filtering out 1,543,476 toxic conversations. The dataset includes state, country, hashed IP addresses, request headers, and full conversation transcripts. The dataset contains a broad spectrum of user-chatbot interactions: ambiguous requests, code-switching, topic shifts, political debates, and more. It also contains **111,836** non-toxic conversations from **reasoning models** `o1-preview` and `o1-mini`. This version includes only **non-toxic conversations** as flagged by the OpenAI Moderations API or Detoxify. For most use cases that do not require toxic data, this dataset is recommended. If you need access to a version that contains both toxic and non-toxic conversations, please refer to the gated [WildChat-4.8M-Full](https://huggingface.co/datasets/allenai/WildChat-4.8M-Full). ### Updates **2025-08-11: Content Update** - Extended coverage to data up to (but excluding) August 1, 2025. - Released the [data processing script](https://github.com/da03/wildchat) used to construct this dataset. - Added TruffleHog scanning to remove verified secrets from the conversations. - Highlight: **111,836 reasoning model conversations** from `o1-preview` and `o1-mini`. ### Full Version with Toxic Content For access to the full version of the WildChat dataset which includes toxic conversations, please refer to [WildChat-4.8M-Full](https://huggingface.co/datasets/allenai/WildChat-4.8M-Full). That version is gated and requires manual approval with a detailed justification for why toxic data is needed. ### Statistics | Model Family | Count | |----------------|-----------| | gpt-4o | 1,539,780 | | gpt-3.5-turbo | 688,900 | | gpt-4.1-mini | 634,037 | | gpt-4 | 202,915 | | o1-mini | 58,529 | | o1-preview | 53,307 | | gpt-4-turbo | 22,392 | | **Total** | 3,199,860 | ### Data Fields - `conversation_hash` (string): The hash of each conversation's content. This is not a unique key, as different conversations with the same content will share the same hash. For unique identifiers, use `turn_identifier` within each turn. - `model` (string): The underlying OpenAI model, such as gpt-3.5-turbo or gpt-4. - `timestamp` (timestamp): The timestamp of the last turn in the conversation in UTC. - `conversation` (list): A list of user/assistant utterances. Each utterance is a dictionary containing the `role` of the speaker (user or assistant), the `content` of the utterance, the detected `language` of the utterance, whether the content of the utterance is considered `toxic`, and whether PII has been detected and anonymized (`redacted`). For user turns, there's also the hashed IP address `hashed_ip` of the turn, the state `state` and country `country` inferred from the original IP address, and the request headers `header` (which might be useful for linking multiple conversations from the same user when used in conjunction with `hashed_ip`). For assistant turns, there's a field `timestamp` which is the time when the backend server receives the full response from ChatGPT. For both user and assistant turns, there's a unique identifier `turn_identifier`. - `turn` (int): The number of turns in the conversation. A turn refers to one round of user-assistant interaction. - `language` (string): The language of the conversation. Note that this is the most frequently detected language in the utterances of the conversation. - `openai_moderation` (list): A list of OpenAI Moderation results. Each element in the list corresponds to one utterance in the conversation. When the content of an utterance is an empty string, the corresponding moderation reult is set to be an empty dictionary. - `detoxify_moderation` (list): A list of Detoxify results. Each element in the list corresponds to one utterance in the conversation. When the content of an utterance is an empty string, the corresponding Detoxify reult is set to be an empty dictionary. - `toxic` (bool): Whether this conversation contains any utterances considered to be toxic by either OpenAI Moderation or Detoxify. - `redacted` (bool): Whether this conversation contains any utterances in which PII or API secrets are detected and anonymized. - `state` (string): The state inferred from the most common IP address in the conversation. Its value is sometimes `None` when GeoIP2 does not identify the state of an IP address. - `country` (string): The country inferred from the most common IP address in the conversation. Its value is sometimes `None` when GeoIP2 does not identify the country of an IP address. - `hashed_ip` (string): The most common hashed IP address in the conversation. - `header` (string): The request header containing information about operating system, browser versions, and accepted languages. This field might be useful for linking multiple conversations from the same user when used in conjunction with `hashed_ip`. Note that every turn in a conversation has the same header, as this is the way we linked turns into conversations. ### Languages Covers dozens of languages (68 detected in earlier releases). ### Personal and Sensitive Information The dataset has been de-identified with Microsoft Presidio, custom regex rules, and manual adjustments. Verified secrets were removed using TruffleHog scanning. ### Reserved Data for Evaluation A small subset of conversations from WildChat was reserved for building [WildBench](https://arxiv.org/abs/2406.04770), a benchmark for evaluating large language models on real-world user queries. ### Empty User Inputs This dataset includes a small subset of conversations where users submitted empty inputs, sometimes leading to hallucinated responses from the assistant. This behavior, first noticed by @yuchenlin, arises from the design of our Hugging Face chatbot used for data collection, which did not restrict the submission of empty inputs. As a result, users could submit without entering any text, causing the assistant to generate responses without any user prompts. This observation motivated our work [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464), which uses empty or template-only prompts to elicit self-generated queries from aligned LLMs for large-scale instruction data synthesis. ### Data Removal Requests If you believe your own data is included in WildChat and you would like it removed, or if you encounter content that is illegal, you may request deletion. To do so, please contact me using the information on my homepage: [https://yuntiandeng.com](https://yuntiandeng.com). Please include: - **Conversation hash(es)** and/or **turn identifier(s)** corresponding to the entries you wish to remove. - A brief explanation of the reason for removal. - Any additional information that could help verify authorship or confirm the issue. ### Citation Information Please consider citing the following papers if you find this dataset useful: ``` @inproceedings{ zhao2024wildchat, title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild}, author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=Bl8u7ZRlbM} } ``` ``` @inproceedings{deng2024wildvis, title = "{W}ild{V}is: Open Source Visualizer for Million-Scale Chat Logs in the Wild", author = "Deng, Yuntian and Zhao, Wenting and Hessel, Jack and Ren, Xiang and Cardie, Claire and Choi, Yejin", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", year = "2024", url = "https://aclanthology.org/2024.emnlp-demo.50/" } ```

# WildChat-4.8M 数据集卡片 ## 数据集基本信息 - **交互式搜索工具:** https://wildvisualizer.com - **WildChat 论文:** https://arxiv.org/abs/2405.01470 - **WildVis 论文:** https://arxiv.org/abs/2409.03753 - **联系人:** [Yuntian Deng](https://yuntiandeng.com/) ## 数据集详细描述 - **交互式搜索工具:** https://wildvisualizer.com - **WildChat 论文:** https://arxiv.org/abs/2405.01470 - **WildVis 论文:** https://arxiv.org/abs/2409.03753 - **联系人:** [Yuntian Deng](https://yuntiandeng.com/) ### 数据集摘要 WildChat-4.8M 是一个包含**3,199,860 条**人类用户与 ChatGPT 的对话数据集。本版本仅包含经 OpenAI 内容审核 API(OpenAI Moderations API)或 Detoxify 工具标记为无不当内容的用户输入与 ChatGPT 回复。该数据集源自 [WildChat-4.8M-Full](https://huggingface.co/datasets/allenai/WildChat-4.8M-Full) 数据集(原始数据集共 4,804,190 条对话,移除未成年人相关内容后剩余 4,743,336 条对话),通过过滤掉 1,543,476 条含不当内容的对话得到。数据集包含对话所属州、国家、哈希化 IP 地址、请求头以及完整的对话转录文本。 该数据集涵盖了多样化的用户与聊天机器人交互场景,包括模糊请求、语码转换、话题转移、政治辩论等。此外,还包含**111,836 条**来自推理模型 `o1-preview` 与 `o1-mini` 的无不当内容对话。 本版本仅包含经 OpenAI 内容审核 API 或 Detoxify 工具标记为无不当内容的对话。对于大多数无需含不当内容数据的应用场景,推荐使用本数据集。若需要同时包含不当与无不当内容对话的版本,请参考受权限控制的 [WildChat-4.8M-Full](https://huggingface.co/datasets/allenai/WildChat-4.8M-Full)。 ### 更新日志 **2025-08-11:内容更新** - 数据覆盖范围扩展至 2025 年 8 月 1 日(不含当日)的数据。 - 发布了用于构建本数据集的[数据处理脚本](https://github.com/da03/wildchat)。 - 新增 TruffleHog 敏感信息扫描步骤,用于移除对话中已验证的敏感密钥。 - 亮点内容:新增来自 `o1-preview` 与 `o1-mini` 的**111,836 条推理模型对话**。 ### 含不当内容的完整版本 如需获取包含不当内容的 WildChat 数据集完整版本,请参考 [WildChat-4.8M-Full](https://huggingface.co/datasets/allenai/WildChat-4.8M-Full)。该版本受权限控制,需提交包含详细理由的申请以获取访问权限,说明需要使用不当内容数据的原因。 ### 统计数据 | 模型家族 | 对话数量 | |----------------|-----------| | gpt-4o | 1,539,780 | | gpt-3.5-turbo | 688,900 | | gpt-4.1-mini | 634,037 | | gpt-4 | 202,915 | | o1-mini | 58,529 | | o1-preview | 53,307 | | gpt-4-turbo | 22,392 | | **总计** | 3,199,860 | ### 数据字段 - `conversation_hash`(字符串):每条对话内容的哈希值。由于内容相同的不同对话会共享同一哈希值,因此该字段并非唯一标识符。如需唯一标识,请使用每一轮对话中的 `turn_identifier` 字段。 - `model`(字符串):所使用的 OpenAI 底层模型,例如 gpt-3.5-turbo 或 gpt-4。 - `timestamp`(时间戳):对话最后一轮交互的 UTC 时间戳。 - `conversation`(列表):用户/助手发言的列表。每条发言为一个字典,包含发言者的 `role`(用户或助手)、发言 `content`、检测到的发言 `language`、发言内容是否被认定为 `toxic`,以及是否已检测并匿名化了个人可识别信息(Personally Identifiable Information, PII)(`redacted` 字段)。对于用户轮次的发言,还包含该轮次的哈希化 IP 地址 `hashed_ip`、从原始 IP 地址推断出的州 `state` 与国家 `country`,以及请求头 `header`(结合 `hashed_ip` 字段可用于关联同一用户的多条对话)。对于助手轮次的发言,包含字段 `timestamp`,即后端服务器收到 ChatGPT 完整回复的时间。无论是用户还是助手的发言,均包含唯一标识符 `turn_identifier`。 - `turn`(整数):对话的轮次数量。一轮交互指一次用户-助手的完整交互循环。 - `language`(字符串):对话的语言。此处指对话中发言检测到的最频繁使用的语言。 - `openai_moderation`(列表):OpenAI 内容审核结果的列表。列表中的每个元素对应对话中的一条发言。若发言内容为空字符串,则对应的审核结果为空字典。 - `detoxify_moderation`(列表):Detoxify 内容审核结果的列表。列表中的每个元素对应对话中的一条发言。若发言内容为空字符串,则对应的审核结果为空字典。 - `toxic`(布尔值):该对话是否包含任意一条被 OpenAI 内容审核或 Detoxify 工具认定为不当的发言。 - `redacted`(布尔值):该对话是否包含任意一条已检测并匿名化了 PII 或 API 密钥的发言。 - `state`(字符串):从对话中最常见的 IP 地址推断出的州。当 GeoIP2 无法识别 IP 地址对应的州时,该字段值为 `None`。 - `country`(字符串):从对话中最常见的 IP 地址推断出的国家。当 GeoIP2 无法识别 IP 地址对应的国家时,该字段值为 `None`。 - `hashed_ip`(字符串):对话中最常见的哈希化 IP 地址。 - `header`(字符串):请求头,包含操作系统、浏览器版本与接受语言等信息。结合 `hashed_ip` 字段可用于关联同一用户的多条对话。需注意,对话中的每一轮发言均拥有相同的请求头,这是我们将各轮发言整合为对话的依据。 ### 支持语言 覆盖数十种语言(早期版本中检测到 68 种语言)。 ### 个人与敏感信息处理 本数据集已通过 Microsoft Presidio、自定义正则表达式规则与人工调整完成去标识化处理。已通过 TruffleHog 扫描移除已验证的敏感密钥。 ### 预留评估数据集 WildChat 数据集的一小部分对话被预留用于构建 [WildBench](https://arxiv.org/abs/2406.04770),这是一个用于评估大语言模型(Large Language Model, LLM)在真实世界用户查询上表现的基准数据集。 ### 空用户输入情况 本数据集包含一小部分用户提交空输入的对话,有时会导致助手生成幻觉回复。该现象最早由 @yuchenlin 发现,源于我们用于数据收集的 Hugging Face 聊天机器人的设计缺陷:该机器人未限制空输入提交,因此用户可在未输入任何文本的情况下提交请求,导致助手在无用户提示的情况下生成回复。这一观察结果推动了我们的研究工作 [Magpie: 通过向对齐大语言模型提交空提示来从头生成对齐数据](https://arxiv.org/abs/2406.08464),该工作利用空提示或仅含模板的提示来从对齐大语言模型中诱导出自生成的查询,用于大规模指令数据合成。 ### 数据删除请求 若您认为自己的数据被包含在 WildChat 数据集中并希望将其移除,或遇到了违法内容,您可提出删除请求。 请通过我的个人主页 [https://yuntiandeng.com](https://yuntiandeng.com) 提供的联系方式联系我。请包含以下信息: - 您希望删除的条目对应的**对话哈希值**和/或**对话轮次标识符**。 - 简要说明删除请求的理由。 - 任何可帮助验证作者身份或确认问题的额外信息。 ### 引用信息 若您认为本数据集对您的研究有帮助,请引用以下论文: @inproceedings{ zhao2024wildchat, title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild}, author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=Bl8u7ZRlbM} } @inproceedings{deng2024wildvis, title = "{W}ild{V}is: Open Source Visualizer for Million-Scale Chat Logs in the Wild", author = "Deng, Yuntian and Zhao, Wenting and Hessel, Jack and Ren, Xiang and Cardie, Claire and Choi, Yejin", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", year = "2024", url = "https://aclanthology.org/2024.emnlp-demo.50/" }
提供机构:
maas
创建时间:
2025-08-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作