five

WildChat-1M

收藏
魔搭社区2026-04-28 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/WildChat-1M
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for WildChat ## Dataset Description - **Paper:** https://arxiv.org/abs/2405.01470 - **Interactive Search Tool:** https://wildvisualizer.com ([paper](https://arxiv.org/abs/2409.03753)) - **License:** [ODC-BY](https://opendatacommons.org/licenses/by/1-0/) - **Language(s) (NLP):** multi-lingual - **Point of Contact:** [Yuntian Deng](https://yuntiandeng.com/) ### Dataset Summary WildChat is a collection of 1 million conversations between human users and ChatGPT, alongside demographic data, including state, country, hashed IP addresses, and request headers. We collected WildChat by offering online users free access to OpenAI's GPT-3.5 and GPT-4. In this version, 25.53% of the conversations come from the GPT-4 chatbot, while the rest come from the GPT-3.5 chatbot. The dataset contains a broad spectrum of user-chatbot interactions that are not previously covered by other instruction fine-tuning datasets: for example, interactions include ambiguous user requests, code-switching, topic-switching, political discussions, etc. WildChat can serve both as a dataset for instructional fine-tuning and as a valuable resource for studying user behaviors. Note that this version of the dataset only contains non-toxic user inputs/ChatGPT responses. ### Updates **2024-10-17: Content Update.** Conversations flagged by [Niloofar Mireshghallah](https://homes.cs.washington.edu/~niloofar/) and her collaborators in ["Breaking News: Case Studies of Generative AI's Use in Journalism"](https://arxiv.org/abs/2406.13706) for containing PII or sensitive information have been removed from this version of the dataset. **2024-07-22: Content Update.** All toxic conversations identified by the OpenAI Moderations API or Detoxify have been removed from this version of the dataset. **2024-06-26: License Change.** We have updated the license of WildChat to [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). This change is retroactively applied to any previous downloads under the ImpACT license. ### Full Version with Toxic Content For access to the full version of the WildChat dataset, which includes toxic conversations flagged by the OpenAI Moderations API or Detoxify, please refer to [WildChat-1M-Full](https://huggingface.co/datasets/allenai/WildChat-1M-Full). This version requires approval and justification for why toxic data is needed. ### Languages 68 languages were detected in WildChat. ### Personal and Sensitive Information The data has been de-identified with Microsoft Presidio and hand-written rules by the authors. ### Data Fields - `conversation_hash` (string): The hash of each conversation's content. This is not a unique key, as different conversations with the same content will share the same hash. For unique identifiers, use `turn_identifier` within each turn. - `model` (string): The underlying OpenAI model, such as gpt-3.5-turbo or gpt-4. - `timestamp` (timestamp): The timestamp of the last turn in the conversation in UTC. - `conversation` (list): A list of user/assistant utterances. Each utterance is a dictionary containing the `role` of the speaker (user or assistant), the `content` of the utterance, the detected `language` of the utterance, whether the content of the utterance is considered `toxic`, and whether PII has been detected and anonymized (`redacted`). For user turns, there's also the hashed IP address `hashed_ip` of the turn, the state `state` and country `country` inferred from the original IP address, and the request headers `header` (which might be useful for linking multiple conversations from the same user when used in conjunction with `hashed_ip`). For assistant turns, there's a field `timestamp` which is the time when the backend server receives the full response from ChatGPT. For both user and assistant turns, there's a unique idenifier `turn_identifier`. - `turn` (int): The number of turns in the conversation. A turn refers to one round of user-assistant interaction. - `language` (string): The language of the conversation. Note that this is the most frequently detected language in the utterances of the conversation. - `openai_moderation` (list): A list of OpenAI Moderation results. Each element in the list corresponds to one utterance in the conversation. When the content of an utterance is an empty string, the corresponding moderation reult is set to be an empty dictionary. - `detoxify_moderation` (list): A list of Detoxify results. Each element in the list corresponds to one utterance in the conversation. When the content of an utterance is an empty string, the corresponding Detoxify reult is set to be an empty dictionary. - `toxic` (bool): Whether this conversation contains any utterances considered to be toxic by either OpenAI Moderation or Detoxify. - `redacted` (bool): Whether this conversation contains any utterances in which PII is detected and anonymized. - `state` (string): The state inferred from the most common IP address in the conversation. Its value is sometimes `None` when GeoIP2 does not identify the state of an IP address. - `country` (string): The country inferred from the most common IP address in the conversation. Its value is sometimes `None` when GeoIP2 does not identify the country of an IP address. - `hashed_ip` (string): The most common hashed IP address in the conversation. - `header` (string): The request header containing information about operating system, browser versions, and accepted languages. This field might be useful for linking multiple conversations from the same user when used in conjunction with `hashed_ip`. Note that every turn in a conversation has the same header, as this is the way we linked turns into conversations. ### Empty User Inputs This dataset includes a small subset of conversations where users submitted empty inputs, sometimes leading to hallucinated responses from the assistant. This issue, first noticed by @yuchenlin, arises from the design of our Huggingface chatbot used for data collection, which did not restrict the submission of empty inputs. As a result, users could submit without entering any text, causing the assistant to generate responses without any user prompts. This occurs in a small fraction of the dataset. ### Licensing Information WildChat is now made available under the [**ODC-BY License**](https://opendatacommons.org/licenses/by/1-0/). This change is retroactively applied to any previous downloads under the ImpACT license. ### Citation Information Please consider citing [our paper](https://arxiv.org/abs/2405.01470) if you find this dataset useful: ``` @inproceedings{ zhao2024wildchat, title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild}, author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=Bl8u7ZRlbM} } ``` ``` @misc{deng2024wildvisopensourcevisualizer, title={WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild}, author={Yuntian Deng and Wenting Zhao and Jack Hessel and Xiang Ren and Claire Cardie and Yejin Choi}, year={2024}, eprint={2409.03753}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.03753}, } ```

# WildChat 数据集卡片(Dataset Card) ## 数据集描述(Dataset Description) - **论文(Paper):** https://arxiv.org/abs/2405.01470 - **交互式搜索工具(Interactive Search Tool):** https://wildvisualizer.com ([论文(paper)](https://arxiv.org/abs/2409.03753)) - **许可证(License):** [ODC-BY](https://opendatacommons.org/licenses/by/1-0/) - **自然语言处理所用语言(Language(s) (NLP)):** 多语言 - **联系方式(Point of Contact):** [邓云天(Yuntian Deng)](https://yuntiandeng.com/) ### 数据集摘要(Dataset Summary) WildChat 是一个包含100万人类用户与ChatGPT对话的数据集,同时附带人口统计数据,包括地区(state)、国家、哈希化IP地址(hashed IP addresses)以及请求头(request headers)。我们通过向在线用户免费开放OpenAI的GPT-3.5和GPT-4访问权限来收集WildChat数据集。本版本中,25.53%的对话来自GPT-4聊天机器人,其余则来自GPT-3.5聊天机器人。该数据集涵盖了此前其他指令微调(instruction fine-tuning)数据集未覆盖的广泛用户-聊天机器人交互场景:例如模糊的用户请求、语码转换(code-switching)、话题切换、政治讨论等。WildChat 既可作为指令微调的数据集,也可作为研究用户行为的宝贵资源。请注意,本版本的数据集仅包含无毒的用户输入/ChatGPT回复。 ### 更新记录(Updates) **2024-10-17:内容更新。** 由[Niloofar Mireshghallah](https://homes.cs.washington.edu/~niloofar/)及其合作者在《突发新闻:生成式人工智能(Generative AI)在新闻业中的应用案例研究》("Breaking News: Case Studies of Generative AI's Use in Journalism",arXiv:2406.13706)中标记为包含个人可识别信息(Personally Identifiable Information,简称PII)或敏感信息的对话,已从本版本的数据集中移除。 **2024-07-22:内容更新。** 所有被OpenAI审核API(OpenAI Moderations API)或Detoxify识别为有毒的对话,已从本版本的数据集中移除。 **2024-06-26:许可证变更。** 我们已将WildChat的许可证更新为[ODC-BY](https://opendatacommons.org/licenses/by/1-0/)。此变更将追溯应用于此前以ImpACT许可证下载的所有版本。 ### 包含有毒内容的完整版本 若需获取包含被OpenAI审核API或Detoxify标记的有毒对话的WildChat完整数据集,请参阅[WildChat-1M-Full](https://huggingface.co/datasets/allenai/WildChat-1M-Full)。申请该版本数据集需要提供审批流程,并说明需要有毒数据的理由。 ### 语言 WildChat 中共检测到68种语言。 ### 个人与敏感信息 本数据集已由作者使用Microsoft Presidio工具和手写规则进行去标识化(de-identified)处理。 ### 数据字段(Data Fields) - `conversation_hash`(字符串类型):每条对话内容的哈希值。由于内容相同的不同对话会共享同一哈希值,因此该字段并非唯一标识符。如需唯一标识符,请使用每一轮对话中的`turn_identifier`。 - `model`(字符串类型):所用的OpenAI底层模型,例如gpt-3.5-turbo或gpt-4。 - `timestamp`(时间戳类型):对话最后一轮交互的UTC时间戳。 - `conversation`(列表类型):用户/助手发言的列表。每条发言为一个字典,包含发言者的角色(`role`,用户或助手)、发言内容(`content`)、检测到的发言语言(`language`)、发言内容是否被视为有毒(`toxic`),以及个人可识别信息是否已被检测并匿名化(`redacted`)。对于用户轮次的发言,还包含该轮次的哈希化IP地址`hashed_ip`、从原始IP地址推断出的地区`state`和国家`country`,以及请求头`header`(结合`hashed_ip`使用时,可用于关联同一用户的多条对话)。对于助手轮次的发言,包含字段`timestamp`,即后端服务器收到ChatGPT完整回复的时间。无论是用户还是助手的发言,均包含唯一标识符`turn_identifier`。 - `turn`(整数类型):对话的轮次数量。一轮指一次用户-助手交互循环。 - `language`(字符串类型):对话的语言。此处指该对话发言中检测到的最频繁使用的语言。 - `openai_moderation`(列表类型):OpenAI审核结果的列表。列表中的每个元素对应对话中的一条发言。若发言内容为空字符串,则对应的审核结果为空字典。 - `detoxify_moderation`(列表类型):Detoxify审核结果的列表。列表中的每个元素对应对话中的一条发言。若发言内容为空字符串,则对应的Detoxify结果为空字典。 - `toxic`(布尔类型):该对话是否包含任何被OpenAI审核或Detoxify判定为有毒的发言。 - `redacted`(布尔类型):该对话是否包含任何已被检测并匿名化处理个人可识别信息的发言。 - `state`(字符串类型):从对话中最常见的IP地址推断出的地区。当GeoIP2无法识别IP地址对应的地区时,该字段值可能为`None`。 - `country`(字符串类型):从对话中最常见的IP地址推断出的国家。当GeoIP2无法识别IP地址对应的国家时,该字段值可能为`None`。 - `hashed_ip`(字符串类型):对话中最常见的哈希化IP地址。 - `header`(字符串类型):包含操作系统、浏览器版本和接受语言等信息的请求头。结合`hashed_ip`使用时,可用于关联同一用户的多条对话。请注意,对话中的每一轮发言均使用相同的请求头,这也是我们将多条发言整合为对话的依据。 ### 空用户输入 本数据集包含一小部分用户提交空输入的对话,有时会导致助手生成幻觉式回复。该问题最初由@yuchenlin发现,源于我们用于数据收集的Huggingface聊天机器人(Huggingface chatbot)的设计缺陷:该机器人未限制空输入的提交。因此,用户可在未输入任何文本的情况下提交请求,导致助手在无用户提示的情况下生成回复。这种情况仅出现在数据集的极小一部分样本中。 ### 许可证信息 WildChat 现已采用[**ODC-BY许可证**](https://opendatacommons.org/licenses/by/1-0/)发布。此变更将追溯应用于此前以ImpACT许可证下载的所有版本。 ### 引用信息 若您认为本数据集对您的研究有所帮助,请引用我们的论文[our paper](https://arxiv.org/abs/2405.01470): @inproceedings{ zhao2024wildchat, title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild}, author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng}, booktitle={The Twelfth International Conference on Learning Representations}, year={2024}, url={https://openreview.net/forum?id=Bl8u7ZRlbM} } @misc{deng2024wildvisopensourcevisualizer, title={WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild}, author={Yuntian Deng and Wenting Zhao and Jack Hessel and Xiang Ren and Claire Cardie and Yejin Choi}, year={2024}, eprint={2409.03753}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.03753}, }
提供机构:
maas
创建时间:
2025-05-28
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
WildChat-1M是一个包含100万条人类用户与ChatGPT交互对话的数据集,其中约四分之一来自GPT-4模型,其余来自GPT-3.5。数据集涵盖多语言对话,并附带有用户地理位置和哈希IP地址等人口统计信息,经过清理移除了有毒和敏感内容,适用于指令微调和用户行为分析研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作