wildguardmix

Name: wildguardmix
Creator: maas
Published: 2025-12-26 16:35:18
License: 暂无描述

魔搭社区2025-12-26 更新2024-11-30 收录

下载链接：

https://modelscope.cn/datasets/allenai/wildguardmix

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for WildGuardMix ## Disclaimer: The data includes examples that might be disturbing, harmful or upsetting. It includes a range of harmful topics such as discriminatory language and discussions about abuse, violence, self-harm, sexual content, misinformation among other high-risk categories. The main goal of this data is for advancing research in building safe LLMs. It is recommended not to train a LLM exclusively on the harmful examples. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/625888c2c964680d85d70bff/N-bHkrpknEG_PWvjfds2u.png) ## Dataset Summary WildGuardMix consists of two splits, WildGuardTrain and WildGuardTest. Here's the breakdown of WildGuardMix: **WildGuardTrain** used to train WildGuard: - Data Size: corresponds of 86,759 examples, of which 48,783 are prompt-only and 37,976 contain a prompt and response. - Data types: synthetic data (87%), in-the-wild user-LLLM interactions (11%), and existing annotator-written data (2%). - Prompts types: vanilla and adversarial that cover both harmful and benign scenarios. - Response generations: for the synthetic adversarial and vanilla prompts, we generate matched refusal and compliance responses using a suite of LLMs. - Labels: for prompt harmfulness, response harmfulness, and response refusal are obtained via GPT-4. - Data audit: - Filtering: to filter responses created through open LMs by assigning labels for each of our three target tasks using GPT-4 and recategorizing items that fail to match intended labels. - Human annotation: audit the quality of GPT-4 labels by sampling 500 items and collecting human annotations. 92%, 82%, and 95% agreement of items for prompt harm, response harm, and refusal labels, respectively. **WildGuardTest** used to evaluate safety classifiers. - Data size: contains 1,725 items for prompt harm, response harm, and response refusal classification tasks. 55% are vanilla prompts and 45% are adversarial. - Similar to WildGuardTrain, the data consists of vanilla and adversarial synthetic data and in-the-wild user-LLLM interactions covering both benign and harmful scenarios. - Labels: - collect annotations from three independent annotators for each prompt-response pair on prompt harmfulness, response refusal, and response harmfulness. - Labels quality: - Fleiss Kappa scores are 0.55, 0.72, and 0.50 for the three tasks, indicating moderate to substantial agreement. - Run a prompted GPT-4 classifier on the dataset and manually inspect items on which the output mismatches the chosen annotator label, to further audit the ground-truth labels. Please check the paper for further details on data construction: [WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://arxiv.org/abs/2406.18495). ## Usage ```python from datasets import load_dataset # Load the wildguardtrain dataset dataset = load_dataset("allenai/wildguardmix", "wildguardtrain") # Load the wildguardtest dataset dataset = load_dataset("allenai/wildguardmix", "wildguardtest") ``` ## Dataset Details The dataset contains the following columns: - `prompt`: str, indicates the user request. - `adversarial`: bool, indicates whether the prompt is adversarial or not. - `response`: str, or None for prompt-only items in WildGuardTrain. - `prompt_harm_label`: str ("harmful" or "unharmful"), or None for items lacking annotator agreement for `prompt_harm_label`. It is possible that other labels, such as `response_harm_label`, is not None but `prompt_harm_label` is None. - `response_harm_label`: str ("harmful" or "unharmful"), or None for prompt-only items in WildGuardTrain and items lacking annotator agreement for `response_harm_label`. It is possible that other labels, such as `prompt_harm_label`, is not None but `response_harm_label` is None. - `response_refusal_label`: str ("refusal" or "compliance"), or None for prompt-only items in WildGuardTrain and items lacking annotator agreement for `response_refusal_label`. It is possible that other labels, such as `prompt_harm_label`, is not None but `response_refusal_label` is None. - `subcategory`: str, indicates the fine-grained risk category of the prompt. Additionally, we provide columns of prompt_harm_agreement, response_harm_agreement, and response_refusal_agreement for WildGuardTest which show whether each label is obtained with two-way or three-way inter-annotator agreement. ## Citation ``` @misc{wildguard2024, title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs}, author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri}, year={2024}, eprint={2406.18495}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.18495}, } ```

# WildGuardMix 数据集卡片 ## 免责声明：本数据集包含可能令人不安、有害或引发不适的示例，涵盖歧视性语言、虐待、暴力、自残、色情内容、错误信息等各类高风险主题。本数据集的核心目标是推动面向安全大语言模型（Large Language Model, LLM）构建的相关研究。**请勿仅基于其中的有害示例对大语言模型进行训练**。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/625888c2c964680d85d70bff/N-bHkrpknEG_PWvjfds2u.png) ## 数据集概览 WildGuardMix包含两个拆分集：WildGuardTrain与WildGuardTest。以下是该数据集的详细构成： **WildGuardTrain** 用于训练WildGuard模型： - 数据规模：总计86,759条示例，其中48,783条仅包含提示词（prompt），37,976条同时包含提示词与回复。 - 数据类型：合成数据（87%）、真实场景下用户与大语言模型的交互数据（11%），以及已有的标注员撰写数据（2%）。 - 提示词类型：普通（vanilla）与对抗式两类，覆盖有害与无害场景。 - 回复生成：针对合成的对抗式与普通提示词，我们使用多款大语言模型生成匹配的拒绝回复与合规回复。 - 标签：通过GPT-4获取提示词有害性、回复有害性以及回复拒绝行为的相关标签。 - 数据审核： - 过滤流程：针对通过开源大语言模型生成的回复，使用GPT-4为我们的三个目标任务分配标签，并重新归类不符合预期标签的条目。 - 人工标注：通过抽样500条条目并收集人工标注结果，审核GPT-4标签的质量。最终在提示词有害性、回复有害性与拒绝行为标签上，标注一致性分别达到92%、82%与95%。 **WildGuardTest** 用于评估安全分类器。 - 数据规模：包含1,725条用于提示词有害性、回复有害性与回复拒绝行为分类任务的条目。其中55%为普通提示词，45%为对抗式提示词。 - 与WildGuardTrain类似，该数据集包含普通（vanilla）与对抗式合成数据，以及覆盖无害与有害场景的真实场景用户-大语言模型交互数据。 - 标签：为每个提示词-回复对收集三名独立标注员关于提示词有害性、回复拒绝行为与回复有害性的标注结果。 - 标签质量： - 三个任务的Fleiss Kappa系数分别为0.55、0.72与0.50，表明标注一致性处于中等至高度一致水平。 - 使用带提示的GPT-4分类器对数据集进行测试，并手动检查输出与选定的标注员标签不一致的条目，以进一步审核真实标签的质量。如需了解数据构建的更多细节，请查阅论文：《WildGuard：面向大语言模型安全风险、越狱攻击与拒绝响应的一站式开源审核工具》（https://arxiv.org/abs/2406.18495）。 ## 使用方法 python from datasets import load_dataset # 加载wildguardtrain数据集 dataset = load_dataset("allenai/wildguardmix", "wildguardtrain") # 加载wildguardtest数据集 dataset = load_dataset("allenai/wildguardmix", "wildguardtest") ## 数据集详情本数据集包含以下字段： - `prompt`：字符串类型，表示用户请求（提示词）。 - `adversarial`：布尔类型，表示该提示词是否为对抗式提示词。 - `response`：字符串类型，对于WildGuardTrain中仅包含提示词的条目，该字段为None。 - `prompt_harm_label`：字符串类型，取值为"harmful"（有害）或"unharmful"（无害）；若某条目的`prompt_harm_label`未获得标注员一致认可，则该字段为None。存在`response_harm_label`等其他字段有值但`prompt_harm_label`为None的情况。 - `response_harm_label`：字符串类型，取值为"harmful"（有害）或"unharmful"（无害）；对于WildGuardTrain中仅包含提示词的条目，以及未获得标注员一致认可的条目，该字段为None。存在`prompt_harm_label`等其他字段有值但`response_harm_label`为None的情况。 - `response_refusal_label`：字符串类型，取值为"refusal"（拒绝）或"compliance"（合规）；对于WildGuardTrain中仅包含提示词的条目，以及未获得标注员一致认可的条目，该字段为None。存在`prompt_harm_label`等其他字段有值但`response_refusal_label`为None的情况。 - `subcategory`：字符串类型，表示该提示词的细粒度风险类别。此外，我们为WildGuardTest提供了`prompt_harm_agreement`、`response_harm_agreement`与`response_refusal_agreement`三个字段，用于表明每个标签是通过双向还是三方标注员一致性审核得到的。 ## 引用 bibtex @misc{wildguard2024, title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs}, author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri}, year={2024}, eprint={2406.18495}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.18495}, }

提供机构：

maas

创建时间：

2025-05-27

搜集汇总

数据集介绍