wildguardmix
收藏魔搭社区2025-12-26 更新2024-11-30 收录
下载链接:
https://modelscope.cn/datasets/allenai/wildguardmix
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for WildGuardMix
## Disclaimer:
The data includes examples that might be disturbing, harmful or upsetting. It includes a range of harmful topics such as discriminatory language and discussions
about abuse, violence, self-harm, sexual content, misinformation among other high-risk categories. The main goal of this data is for advancing research in building safe LLMs.
It is recommended not to train a LLM exclusively on the harmful examples.

## Dataset Summary
WildGuardMix consists of two splits, WildGuardTrain and WildGuardTest. Here's the breakdown of WildGuardMix:
**WildGuardTrain** used to train WildGuard:
- Data Size: corresponds of 86,759 examples, of which 48,783 are prompt-only and 37,976 contain a prompt and response.
- Data types: synthetic data (87%), in-the-wild user-LLLM interactions (11%), and existing annotator-written data (2%).
- Prompts types: vanilla and adversarial that cover both harmful and benign scenarios.
- Response generations: for the synthetic adversarial and vanilla prompts, we generate matched refusal and compliance responses using a suite of LLMs.
- Labels: for prompt harmfulness, response harmfulness, and response refusal are obtained via GPT-4.
- Data audit:
- Filtering: to filter responses created through open LMs by assigning labels for each of our three target tasks using GPT-4 and recategorizing items that fail to match intended labels.
- Human annotation: audit the quality of GPT-4 labels by sampling 500 items and collecting human annotations. 92%, 82%, and 95% agreement of items for prompt harm, response harm, and refusal labels, respectively.
**WildGuardTest** used to evaluate safety classifiers.
- Data size: contains 1,725 items for prompt harm, response harm, and response refusal classification tasks. 55% are vanilla prompts and 45% are adversarial.
- Similar to WildGuardTrain, the data consists of vanilla and adversarial synthetic data and in-the-wild user-LLLM interactions covering both benign and harmful scenarios.
- Labels:
- collect annotations from three independent annotators for each prompt-response pair on prompt harmfulness, response refusal, and response harmfulness.
- Labels quality:
- Fleiss Kappa scores are 0.55, 0.72, and 0.50 for the three tasks, indicating moderate to substantial agreement.
- Run a prompted GPT-4 classifier on the dataset and manually inspect items on which the output mismatches the chosen annotator label, to further audit the ground-truth labels.
Please check the paper for further details on data construction: [WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://arxiv.org/abs/2406.18495).
## Usage
```python
from datasets import load_dataset
# Load the wildguardtrain dataset
dataset = load_dataset("allenai/wildguardmix", "wildguardtrain")
# Load the wildguardtest dataset
dataset = load_dataset("allenai/wildguardmix", "wildguardtest")
```
## Dataset Details
The dataset contains the following columns:
- `prompt`: str, indicates the user request.
- `adversarial`: bool, indicates whether the prompt is adversarial or not.
- `response`: str, or None for prompt-only items in WildGuardTrain.
- `prompt_harm_label`: str ("harmful" or "unharmful"), or None for items lacking annotator agreement for `prompt_harm_label`. It is possible that other labels, such as `response_harm_label`, is not None but `prompt_harm_label` is None.
- `response_harm_label`: str ("harmful" or "unharmful"), or None for prompt-only items in WildGuardTrain and items lacking annotator agreement for `response_harm_label`. It is possible that other labels, such as `prompt_harm_label`, is not None but `response_harm_label` is None.
- `response_refusal_label`: str ("refusal" or "compliance"), or None for prompt-only items in WildGuardTrain and items lacking annotator agreement for `response_refusal_label`. It is possible that other labels, such as `prompt_harm_label`, is not None but `response_refusal_label` is None.
- `subcategory`: str, indicates the fine-grained risk category of the prompt.
Additionally, we provide columns of prompt_harm_agreement, response_harm_agreement, and response_refusal_agreement for WildGuardTest which show whether each label is obtained with two-way or three-way inter-annotator agreement.
## Citation
```
@misc{wildguard2024,
title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs},
author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri},
year={2024},
eprint={2406.18495},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.18495},
}
```
# WildGuardMix 数据集卡片
## 免责声明:
本数据集包含可能令人不安、有害或引发不适的示例,涵盖歧视性语言、虐待、暴力、自残、色情内容、错误信息等各类高风险主题。本数据集的核心目标是推动面向安全大语言模型(Large Language Model, LLM)构建的相关研究。**请勿仅基于其中的有害示例对大语言模型进行训练**。

## 数据集概览
WildGuardMix包含两个拆分集:WildGuardTrain与WildGuardTest。以下是该数据集的详细构成:
**WildGuardTrain** 用于训练WildGuard模型:
- 数据规模:总计86,759条示例,其中48,783条仅包含提示词(prompt),37,976条同时包含提示词与回复。
- 数据类型:合成数据(87%)、真实场景下用户与大语言模型的交互数据(11%),以及已有的标注员撰写数据(2%)。
- 提示词类型:普通(vanilla)与对抗式两类,覆盖有害与无害场景。
- 回复生成:针对合成的对抗式与普通提示词,我们使用多款大语言模型生成匹配的拒绝回复与合规回复。
- 标签:通过GPT-4获取提示词有害性、回复有害性以及回复拒绝行为的相关标签。
- 数据审核:
- 过滤流程:针对通过开源大语言模型生成的回复,使用GPT-4为我们的三个目标任务分配标签,并重新归类不符合预期标签的条目。
- 人工标注:通过抽样500条条目并收集人工标注结果,审核GPT-4标签的质量。最终在提示词有害性、回复有害性与拒绝行为标签上,标注一致性分别达到92%、82%与95%。
**WildGuardTest** 用于评估安全分类器。
- 数据规模:包含1,725条用于提示词有害性、回复有害性与回复拒绝行为分类任务的条目。其中55%为普通提示词,45%为对抗式提示词。
- 与WildGuardTrain类似,该数据集包含普通(vanilla)与对抗式合成数据,以及覆盖无害与有害场景的真实场景用户-大语言模型交互数据。
- 标签:为每个提示词-回复对收集三名独立标注员关于提示词有害性、回复拒绝行为与回复有害性的标注结果。
- 标签质量:
- 三个任务的Fleiss Kappa系数分别为0.55、0.72与0.50,表明标注一致性处于中等至高度一致水平。
- 使用带提示的GPT-4分类器对数据集进行测试,并手动检查输出与选定的标注员标签不一致的条目,以进一步审核真实标签的质量。
如需了解数据构建的更多细节,请查阅论文:《WildGuard:面向大语言模型安全风险、越狱攻击与拒绝响应的一站式开源审核工具》(https://arxiv.org/abs/2406.18495)。
## 使用方法
python
from datasets import load_dataset
# 加载wildguardtrain数据集
dataset = load_dataset("allenai/wildguardmix", "wildguardtrain")
# 加载wildguardtest数据集
dataset = load_dataset("allenai/wildguardmix", "wildguardtest")
## 数据集详情
本数据集包含以下字段:
- `prompt`:字符串类型,表示用户请求(提示词)。
- `adversarial`:布尔类型,表示该提示词是否为对抗式提示词。
- `response`:字符串类型,对于WildGuardTrain中仅包含提示词的条目,该字段为None。
- `prompt_harm_label`:字符串类型,取值为"harmful"(有害)或"unharmful"(无害);若某条目的`prompt_harm_label`未获得标注员一致认可,则该字段为None。存在`response_harm_label`等其他字段有值但`prompt_harm_label`为None的情况。
- `response_harm_label`:字符串类型,取值为"harmful"(有害)或"unharmful"(无害);对于WildGuardTrain中仅包含提示词的条目,以及未获得标注员一致认可的条目,该字段为None。存在`prompt_harm_label`等其他字段有值但`response_harm_label`为None的情况。
- `response_refusal_label`:字符串类型,取值为"refusal"(拒绝)或"compliance"(合规);对于WildGuardTrain中仅包含提示词的条目,以及未获得标注员一致认可的条目,该字段为None。存在`prompt_harm_label`等其他字段有值但`response_refusal_label`为None的情况。
- `subcategory`:字符串类型,表示该提示词的细粒度风险类别。
此外,我们为WildGuardTest提供了`prompt_harm_agreement`、`response_harm_agreement`与`response_refusal_agreement`三个字段,用于表明每个标签是通过双向还是三方标注员一致性审核得到的。
## 引用
bibtex
@misc{wildguard2024,
title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs},
author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri},
year={2024},
eprint={2406.18495},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.18495},
}
提供机构:
maas
创建时间:
2025-05-27
搜集汇总
数据集介绍

背景与挑战
背景概述
WildGuardMix是一个专注于大型语言模型(LLM)安全性研究的数据集,包含训练集和测试集两部分,用于训练和评估安全分类器。数据集涵盖多种数据来源和类型,包括合成数据和真实用户交互数据,并通过GPT-4和人工注释确保标签质量。
以上内容由遇见数据集搜集并总结生成



