five

wildguardmix

收藏
魔搭社区2025-12-26 更新2024-11-30 收录
下载链接:
https://modelscope.cn/datasets/allenai/wildguardmix
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for WildGuardMix ## Disclaimer: The data includes examples that might be disturbing, harmful or upsetting. It includes a range of harmful topics such as discriminatory language and discussions about abuse, violence, self-harm, sexual content, misinformation among other high-risk categories. The main goal of this data is for advancing research in building safe LLMs. It is recommended not to train a LLM exclusively on the harmful examples. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/625888c2c964680d85d70bff/N-bHkrpknEG_PWvjfds2u.png) ## Dataset Summary WildGuardMix consists of two splits, WildGuardTrain and WildGuardTest. Here's the breakdown of WildGuardMix: **WildGuardTrain** used to train WildGuard: - Data Size: corresponds of 86,759 examples, of which 48,783 are prompt-only and 37,976 contain a prompt and response. - Data types: synthetic data (87%), in-the-wild user-LLLM interactions (11%), and existing annotator-written data (2%). - Prompts types: vanilla and adversarial that cover both harmful and benign scenarios. - Response generations: for the synthetic adversarial and vanilla prompts, we generate matched refusal and compliance responses using a suite of LLMs. - Labels: for prompt harmfulness, response harmfulness, and response refusal are obtained via GPT-4. - Data audit: - Filtering: to filter responses created through open LMs by assigning labels for each of our three target tasks using GPT-4 and recategorizing items that fail to match intended labels. - Human annotation: audit the quality of GPT-4 labels by sampling 500 items and collecting human annotations. 92%, 82%, and 95% agreement of items for prompt harm, response harm, and refusal labels, respectively. **WildGuardTest** used to evaluate safety classifiers. - Data size: contains 1,725 items for prompt harm, response harm, and response refusal classification tasks. 55% are vanilla prompts and 45% are adversarial. - Similar to WildGuardTrain, the data consists of vanilla and adversarial synthetic data and in-the-wild user-LLLM interactions covering both benign and harmful scenarios. - Labels: - collect annotations from three independent annotators for each prompt-response pair on prompt harmfulness, response refusal, and response harmfulness. - Labels quality: - Fleiss Kappa scores are 0.55, 0.72, and 0.50 for the three tasks, indicating moderate to substantial agreement. - Run a prompted GPT-4 classifier on the dataset and manually inspect items on which the output mismatches the chosen annotator label, to further audit the ground-truth labels. Please check the paper for further details on data construction: [WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://arxiv.org/abs/2406.18495). ## Usage ```python from datasets import load_dataset # Load the wildguardtrain dataset dataset = load_dataset("allenai/wildguardmix", "wildguardtrain") # Load the wildguardtest dataset dataset = load_dataset("allenai/wildguardmix", "wildguardtest") ``` ## Dataset Details The dataset contains the following columns: - `prompt`: str, indicates the user request. - `adversarial`: bool, indicates whether the prompt is adversarial or not. - `response`: str, or None for prompt-only items in WildGuardTrain. - `prompt_harm_label`: str ("harmful" or "unharmful"), or None for items lacking annotator agreement for `prompt_harm_label`. It is possible that other labels, such as `response_harm_label`, is not None but `prompt_harm_label` is None. - `response_harm_label`: str ("harmful" or "unharmful"), or None for prompt-only items in WildGuardTrain and items lacking annotator agreement for `response_harm_label`. It is possible that other labels, such as `prompt_harm_label`, is not None but `response_harm_label` is None. - `response_refusal_label`: str ("refusal" or "compliance"), or None for prompt-only items in WildGuardTrain and items lacking annotator agreement for `response_refusal_label`. It is possible that other labels, such as `prompt_harm_label`, is not None but `response_refusal_label` is None. - `subcategory`: str, indicates the fine-grained risk category of the prompt. Additionally, we provide columns of prompt_harm_agreement, response_harm_agreement, and response_refusal_agreement for WildGuardTest which show whether each label is obtained with two-way or three-way inter-annotator agreement. ## Citation ``` @misc{wildguard2024, title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs}, author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri}, year={2024}, eprint={2406.18495}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.18495}, } ```

# WildGuardMix 数据集卡片 ## 免责声明: 本数据集包含可能令人不安、有害或引发不适的示例,涵盖歧视性语言、虐待、暴力、自残、色情内容、错误信息等各类高风险主题。本数据集的核心目标是推动面向安全大语言模型(Large Language Model, LLM)构建的相关研究。**请勿仅基于其中的有害示例对大语言模型进行训练**。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/625888c2c964680d85d70bff/N-bHkrpknEG_PWvjfds2u.png) ## 数据集概览 WildGuardMix包含两个拆分集:WildGuardTrain与WildGuardTest。以下是该数据集的详细构成: **WildGuardTrain** 用于训练WildGuard模型: - 数据规模:总计86,759条示例,其中48,783条仅包含提示词(prompt),37,976条同时包含提示词与回复。 - 数据类型:合成数据(87%)、真实场景下用户与大语言模型的交互数据(11%),以及已有的标注员撰写数据(2%)。 - 提示词类型:普通(vanilla)与对抗式两类,覆盖有害与无害场景。 - 回复生成:针对合成的对抗式与普通提示词,我们使用多款大语言模型生成匹配的拒绝回复与合规回复。 - 标签:通过GPT-4获取提示词有害性、回复有害性以及回复拒绝行为的相关标签。 - 数据审核: - 过滤流程:针对通过开源大语言模型生成的回复,使用GPT-4为我们的三个目标任务分配标签,并重新归类不符合预期标签的条目。 - 人工标注:通过抽样500条条目并收集人工标注结果,审核GPT-4标签的质量。最终在提示词有害性、回复有害性与拒绝行为标签上,标注一致性分别达到92%、82%与95%。 **WildGuardTest** 用于评估安全分类器。 - 数据规模:包含1,725条用于提示词有害性、回复有害性与回复拒绝行为分类任务的条目。其中55%为普通提示词,45%为对抗式提示词。 - 与WildGuardTrain类似,该数据集包含普通(vanilla)与对抗式合成数据,以及覆盖无害与有害场景的真实场景用户-大语言模型交互数据。 - 标签:为每个提示词-回复对收集三名独立标注员关于提示词有害性、回复拒绝行为与回复有害性的标注结果。 - 标签质量: - 三个任务的Fleiss Kappa系数分别为0.55、0.72与0.50,表明标注一致性处于中等至高度一致水平。 - 使用带提示的GPT-4分类器对数据集进行测试,并手动检查输出与选定的标注员标签不一致的条目,以进一步审核真实标签的质量。 如需了解数据构建的更多细节,请查阅论文:《WildGuard:面向大语言模型安全风险、越狱攻击与拒绝响应的一站式开源审核工具》(https://arxiv.org/abs/2406.18495)。 ## 使用方法 python from datasets import load_dataset # 加载wildguardtrain数据集 dataset = load_dataset("allenai/wildguardmix", "wildguardtrain") # 加载wildguardtest数据集 dataset = load_dataset("allenai/wildguardmix", "wildguardtest") ## 数据集详情 本数据集包含以下字段: - `prompt`:字符串类型,表示用户请求(提示词)。 - `adversarial`:布尔类型,表示该提示词是否为对抗式提示词。 - `response`:字符串类型,对于WildGuardTrain中仅包含提示词的条目,该字段为None。 - `prompt_harm_label`:字符串类型,取值为"harmful"(有害)或"unharmful"(无害);若某条目的`prompt_harm_label`未获得标注员一致认可,则该字段为None。存在`response_harm_label`等其他字段有值但`prompt_harm_label`为None的情况。 - `response_harm_label`:字符串类型,取值为"harmful"(有害)或"unharmful"(无害);对于WildGuardTrain中仅包含提示词的条目,以及未获得标注员一致认可的条目,该字段为None。存在`prompt_harm_label`等其他字段有值但`response_harm_label`为None的情况。 - `response_refusal_label`:字符串类型,取值为"refusal"(拒绝)或"compliance"(合规);对于WildGuardTrain中仅包含提示词的条目,以及未获得标注员一致认可的条目,该字段为None。存在`prompt_harm_label`等其他字段有值但`response_refusal_label`为None的情况。 - `subcategory`:字符串类型,表示该提示词的细粒度风险类别。 此外,我们为WildGuardTest提供了`prompt_harm_agreement`、`response_harm_agreement`与`response_refusal_agreement`三个字段,用于表明每个标签是通过双向还是三方标注员一致性审核得到的。 ## 引用 bibtex @misc{wildguard2024, title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs}, author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri}, year={2024}, eprint={2406.18495}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.18495}, }
提供机构:
maas
创建时间:
2025-05-27
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
WildGuardMix是一个专注于大型语言模型(LLM)安全性研究的数据集,包含训练集和测试集两部分,用于训练和评估安全分类器。数据集涵盖多种数据来源和类型,包括合成数据和真实用户交互数据,并通过GPT-4和人工注释确保标签质量。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作