WildChat-nontoxic
收藏魔搭社区2025-09-01 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/allenai/WildChat-nontoxic
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for WildChat-nontoxic
## Note: a newer version with 1 million conversations and demographic information can be found [here](https://huggingface.co/datasets/allenai/WildChat-1M).
## Dataset Description
- **Paper:** https://wenting-zhao.github.io/papers/wildchat.pdf
- **License:** https://allenai.org/licenses/impact-lr
- **Language(s) (NLP):** multi-lingual
- **Point of Contact:** [Yuntian Deng](mailto:yuntiand@allenai.org)
### Dataset Summary
WildChat-nontoxic is the nontoxic subset of the [WildChat dataset](https://huggingface.co/datasets/allenai/WildChat), a collection of 530K conversations between human users and ChatGPT. The full WildChat dataset containing 650K conversations can be found [here](https://huggingface.co/datasets/allenai/WildChat). We collected WildChat by offering online users free access to OpenAI's GPT-3.5-Turbo and GPT-4. The dataset contains a broad spectrum of user-chatbot interactions that are not previously covered by other instruction fine-tuning datasets: for example, interactions include ambiguous user requests, code-switching, topic-switching, political discussions, etc. WildChat can serve both as a dataset for instructional fine-tuning and as a valuable resource for studying user behaviors.
WildChat-nontoxic has been openly released under AI2's ImpACT license as a low-risk artifact. The use of WildChat-nontoxic to cause harm is strictly prohibited.
### Languages
66 languages were detected in WildChat.
### Data Fields
- `conversation_id` (string): Each conversation has a unique id.
- `model` (string): The underlying OpenAI model, such as gpt-3.5-turbo or gpt-4.
- `timestamp` (timestamp): The timestamp of the last turn in the conversation in UTC.
- `conversation` (list): A list of user/assistant utterances. Each utterance is a dictionary containing the `role` of the speaker (user or assistant), the `content` of the utterance, the detected `language` of the utterance, whether the content of the utterance is considered `toxic`, and whether PII has been detected and anonymized (`redacted`).
- `turn` (int): The number of turns in the conversation. A turn refers to one round of user-assistant interaction.
- `language` (string): The language of the conversation. Note that this is the most frequently detected language in the utterances of the conversation.
- `openai_moderation` (list): A list of OpenAI Moderation results. Each element in the list corresponds to one utterance in the conversation.
- `detoxify_moderation` (list): A list of Detoxify results. Each element in the list corresponds to one utterance in the conversation.
- `toxic` (bool): Whether this conversation contains any utterances considered to be toxic by either OpenAI Moderation or Detoxify.
- `redacted` (bool): Whether this conversation contains any utterances in which PII is detected and anonymized.
### Personal and Sensitive Information
The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.
### Inappropriate Content
If you discover inappropriate conversations in this nontoxic subset, please report their conversation ids to us for removal by sending us an email or using community discussions.
### Licensing Information
WildChat-nontoxic is made available under the [**AI2
ImpACT License - Low Risk Artifacts ("LR
Agreement")**](https://allenai.org/licenses/impact-lr)
### Citation Information
Please consider citing [our paper](https://arxiv.org/abs/2405.01470) if you find this dataset useful:
```
@inproceedings{
zhao2024wildchat,
title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild},
author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Bl8u7ZRlbM}
}
```
# 数据集卡片:WildChat-nontoxic
## 注意:包含100万条对话及人口统计信息的新版本数据集可在[此处](https://huggingface.co/datasets/allenai/WildChat-1M)获取。
## 数据集说明
- **关联论文:** https://wenting-zhao.github.io/papers/wildchat.pdf
- **授权协议:** https://allenai.org/licenses/impact-lr
- **自然语言处理适用语言:** 多语言
- **联系方式:** [邓云天](mailto:yuntiand@allenai.org)
### 数据集概述
WildChat-nontoxic是[WildChat数据集](https://huggingface.co/datasets/allenai/WildChat)的无毒子集,原始WildChat数据集包含53万条人类用户与ChatGPT的对话。完整的WildChat数据集(含65万条对话)可在[此处](https://huggingface.co/datasets/allenai/WildChat)获取。本数据集通过向在线用户免费开放OpenAI的GPT-3.5-Turbo与GPT-4接口收集得到。其涵盖了此前其他指令微调数据集未覆盖的多样化人机交互场景,例如模糊用户请求、代码转换、话题切换、政治讨论等。WildChat既可作为指令微调数据集使用,也可作为研究用户行为的宝贵资源。
WildChat-nontoxic已通过AI2的ImpACT许可作为低风险制品公开发布。严禁使用WildChat-nontoxic制造危害。
### 语言覆盖
原始WildChat数据集中共检测到66种语言。
### 数据字段
- `conversation_id`(字符串类型):每条对话对应唯一标识符。
- `model`(字符串类型):所用的OpenAI底层模型,例如gpt-3.5-turbo或gpt-4。
- `timestamp`(时间戳类型):对话最后一轮交互的UTC时间戳。
- `conversation`(列表类型):用户/助手的发言序列。每条发言为一个字典,包含发言者角色(`role`,用户或助手)、发言内容(`content`)、检测到的发言语言(`language`)、发言内容是否被判定为有害(`toxic`),以及是否检测到个人可识别信息(PII)并完成匿名化(`redacted`)。
- `turn`(整数类型):对话的交互轮次。一轮交互指一次完整的用户-助手问答循环。
- `language`(字符串类型):对话的主导语言。注:该值为对话中出现频次最高的发言语言。
- `openai_moderation`(列表类型):OpenAI内容审核结果列表。列表中每个元素对应对话中的一条发言。
- `detoxify_moderation`(列表类型):Detoxify内容审核结果列表。列表中每个元素对应对话中的一条发言。
- `toxic`(布尔类型):标记该对话是否存在任意一条被OpenAI审核或Detoxify判定为有害的发言。
- `redacted`(布尔类型):标记该对话是否存在任意一条检测到个人可识别信息并完成匿名化的发言。
### 个人与敏感信息
本数据集已由作者通过Microsoft Presidio工具与手工规则完成去标识化处理。
### 不当内容处理
若您在此无毒子集中发现不当对话,请通过发送邮件或参与社区讨论的方式将对应对话ID告知我们,以便我们移除相关内容。
### 许可协议说明
WildChat-nontoxic依据[**AI2 ImpACT低风险制品许可协议("LR协议")**](https://allenai.org/licenses/impact-lr)公开发布。
### 引用信息
若您认为本数据集对您的研究有所帮助,请引用[我们的论文](https://arxiv.org/abs/2405.01470):
@inproceedings{
zhao2024wildchat,
title={WildChat: 1M Chat{GPT} Interaction Logs in the Wild},
author={Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Bl8u7ZRlbM}
}
提供机构:
maas
创建时间:
2025-05-27



