Aegis-AI-Content-Safety-Dataset-2.0

Name: Aegis-AI-Content-Safety-Dataset-2.0
Creator: maas
Published: 2025-12-04 16:21:16
License: 暂无描述

魔搭社区2025-12-04 更新2025-01-25 收录

下载链接：

https://modelscope.cn/datasets/nv-community/Aegis-AI-Content-Safety-Dataset-2.0

下载链接

链接失效反馈

官方服务：

资源简介：

# 🛡️ Nemotron Content Safety Dataset V2  The **Nemotron Content Safety Dataset V2**, formerly known as Aegis AI Content Safety Dataset 2.0, is comprised of `33,416` annotated interactions between humans and LLMs, split into `30,007` training samples, `1,445` validation samples, and `1,964` test samples. This release is an extension of the previously published [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0). To curate the dataset, we use the HuggingFace version of human preference data about harmlessness from [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf). We extract only the prompts, and elicit responses from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). Mistral v0.1 excels at instruction following, both harmful and unharmful, and thus generates high quality responses for the content moderation categories as compared to more recent models that have higher refusal rates for harmful instructions due to their safety alignment. The dataset adheres to a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories. The dataset leverages a hybrid data generation pipeline that combines human annotations at full dialog level with a multi-LLM "jury" system to assess the safety of responses. ## Dataset Details ### Dataset Description Our curated [training dataset](../data/llama_training_ready/) consists of a mix collected or generated using the following data sources. * Human-written prompts collected from the [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), [Do-Anything-Now DAN](https://github.com/verazuo/jailbreak_llms), and [AI-assisted Red-Teaming](https://www.kaggle.com/datasets/mpwolke/aart-ai-safety) datasets. * LLM responses generated from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) * Response safety labels generated using ensemble of 3 LLMs: [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1), [Mistral-NeMo-12B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct), and [Gemma-2-27B-it](https://huggingface.co/google/gemma-2-27b-it). * We additionally mix-in refusal data generated using [Gemma-2-27B](https://build.nvidia.com/google/gemma-2-27b-it/modelcard.md) using custom prompt-engineered deflection strategies. * Further data augmentation with [topic-following data](https://gitlab-master.nvidia.com/dialogue-research/topic_following/-/tree/main?ref_type=heads) generated using [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) is found to improve the robustness and adaptability of the model to custom categories. Additionally, for addressing the critical harm category of `Suicide and Self-Harm`, we have annotated about 358 samples from this [Suicide Detection dataset](https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch?select=Suicide_Detection.csv) according to our taxonomy. We do not distribute these samples as part of this dataset. For more information on how to use these particular samples, see section on [Dataset Structure](#dataset-structure). **Data Collection Method** * Hybrid: Human annotations with synthetic LLM augmentations **Labeling Method** * Human - Overall content safety labels for the conversation * Synthetic - Response safety label augmentation where needed **Properties:** The Aegis 2.0 dataset contains `25,007` user prompts, user prompts and LLM response single turn, user prompts and LLM response muliple turns. The validation set has `1,245` samples and test set has `1,964` samples from the above distribution. Additionally, the training dataset also contains `5,000` refusals data, where responses are synthetically generated using custom prompt-engineered deflection strategies. The validation set has `200` refusal samples and we do not augment the held out test set with any synthetic refusal data. For training robust guard models that we release along with this dataset, we additionally use a specific type of data called [Topic-Following](https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset) which has approximately ~8k samples that can be downloaded from this link. - **Curated by:** NeMo Guardrails team, Nvidia. - **Paper:** [AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails](https://openreview.net/pdf?id=0MvGCv35wi) - **Language(s) (NLP):** English (may contain small samples in other languages). - **License:** CC-BY-4.0 ## Uses ### Direct Use 1. To build LLM content safety moderation guardrails for LLMs, can be used to train both prompt and response moderation. 2. Can be used for alignment of general purpose LLMs towards safety along with carefully using the data for safe-unsafe preference pairs. ### Out-of-Scope Use The data contain content that may be offensive or upsetting. Topics include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially upsetting subject matter. Please only engage with the data in accordance with your own personal risk tolerance. The data are intended for research purposes, especially research that can make models less harmful. The views expressed in the data do not reflect the views of Nvidia or any of its employees. These data are not intended for training dialogue agents as this will likely lead to harmful model behavior. ## Dataset Structure The main parition of the dataset now contains a minimal set of 9 columns. The column names and descriptions are as follows: - id: the unique ID for the sample - prompt: the first user turn. This will say "REDACTED" if an ID exists in the `reconstruction_id_if_redacted` column for redacted samples from the [Suicide Detection dataset](https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch?select=Suicide_Detection.csv). In this case The ID specified in the `reconstruction_id_if_redacted` column matches the sample in the Suicide dataset. You need to separately download those samples. - response: the first assistant turn. `null` if prompt-only example. Sometimes can have an empty string '' instead of a `null`, in which case the response label would be present. - prompt_label: binary safety label - response_label; binary safety label. `null` if prompt-only example. - prompt_label_source: always `human` annotated - response_label_source: either `human`, `llm_jury`, or `refusal_data_augmentation`. - violated_categories: comma-separated list of categories in order of frequency of annotations. ## Dataset Creation ### Annotation process Quality Assurance (QA) is maintained by the leads of this project. Two to three times per week, leads choose fifteen questions at random of every one hundred completed by three annotators to reevaluate. This accounts for fifteen percent of the data analyzed for three-way agreement at a minimum, often having at least twenty to thirty percent analyzed to further ensure quality. These corrections are sent to each individual annotator as audits, with brief explanations of why certain corrections were made with reference to the project guidelines. Data sets are commonly broken into 2,000-4,000 text-based prompts and delivered to annotators in batches of three to five. In the transitions between each batch, Person In Charge (PIC) or the lead designate at least one full eight hour work day for annotators to self-correct their categorization work. Many training sessions have been held throughout the duration of this project for tips on best practices when self-correcting, including filtering through key words in the data, referencing the regularly updated FAQ sheet with example questions, and choosing completed prompts at random to reevaluate. Annotators are instructed to only self-correct their own work and avoid looking at any other annotations besides their own. Both Leads are also available at all times for further questions or meetings to ensure consistent understanding of the material. Mandatory virtual group training are held every two weeks or as needed depending on the circumstances of the project. These meetings are led by leads and often utilize examples of commonly seen discrepancies to present as learning opportunities. #### Who are the annotators? Throughout the three month time span of the Content Moderation Guardrails project, we have averaged twelve annotators at any given time. Of these twelve, four annotators come from Engineering backgrounds specializing in data analysis and collection, gaming, and robotics. Eight annotators have a background in Creative Writing, with specialization in linguistics, research and development, and other creative arts such as photography and film. All annotators have been extensively trained in working with Large Language Models (LLM), as well as other variations of Generative AI such as image retrieval or evaluations of multi-turn conversations. All are capable of generating creative text-based output and categorization work. Each of these twelve annotators resides in the United States, all from various ethnic and religious backgrounds that allow for representation across race, age, and social status. ## Content Safety Taxonomy ### Non-Unsafe Categories - Safe - Needs Caution ### Core Unsafe Categories - Hate/Identity Hate - Sexual - Suicide and Self Harm - Violence - Guns/Illegal Weapons - Threat - PII/Privacy - Sexual Minor - Criminal Planning/Confessions - Harassment - Controlled/Regulated substances - Profanity ### Fine-grained Unsafe Categories - Illegal Activity - Immoral/Unethical - Unauthorized Advice - Political/Misinformation/Conspiracy - Fraud/Deception - Copyright/Trademark/Plagiarism - High Risk Gov. Decision Making - Malware - Manipulation ### Personal and Sensitive Information This dataset comprises LLM responses generated from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). We have carefully gone through the data and taken out anything that could have personal information in it. However, there is still a chance that some personal information might be left in the data. If you come across anything in the data that you think should not be made public, please let us know right away. ## Bias, Risks, and Limitations - Safety and Moderation: This dataset is intended to be used for building content moderations systems, or aligning LLMs to generate safe responses. By the nature of the work, the dataset contains critically unsafe content and annotations for that content. Extreme care and caution should be exercised in referring and using this dataset. - Legal Compliance: Users of this data are responsible for ensuring its appropriate use. The dataset should not be utilized in manners that conflict with legal and ethical standards. - Non-Identification: Users of this data agree to not attempt to determine the identity of individuals in this dataset. ## Ethical Statement The process in which the _Nemotron Content Safety Dataset V2_ creation abides by ethical data categorization work is based within the tooling of [Label Studio](http://label-studio.nvidia.com/user/login/), an open source data labeling tool often used for Nvidia's internal projects. This tooling technology allows for large sets of data to be analyzed by individual annotators without seeing the work of their peers. This is essential in preventing bias between annotators, as well as delivering prompts to each individual with variability so that no one annotator is completing similar tasks based on how the data was initially arranged. Due to the serious nature of this project, annotators were asked to join on a volunteer basis based on their skill level, availability, and willingness to expose themselves to potentially toxic content. Before work on this project began, all participants were asked to sign an “_Adult Content Acknowledgement_” that coincides with the organization’s existing _AntiHarassment Policy_ and _Code of Conduct_. This was to ensure that all annotators be made aware of the nature of this work, as well as the resources available to them should it affect their mental well-being. Regular 1:1 meetings were held between the leads assigned to this project and each annotator to make sure they are still comfortable with the material and are capable of continuing on with this type of work. ## Citation ``` @inproceedings{ghosh-etal-2025-aegis2, title = "{AEGIS}2.0: A Diverse {AI} Safety Dataset and Risks Taxonomy for Alignment of {LLM} Guardrails", author = "Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher", editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = apr, year = "2025", address = "Albuquerque, New Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.naacl-long.306/", doi = "10.18653/v1/2025.naacl-long.306", pages = "5992--6026", ISBN = "979-8-89176-189-6" } ``` Dataset Card Authors and Contacts: <br> Shaona Ghosh, Prasoon Varshney <br> {shaonag,prasoonv}@nvidia.com

# 🛡️ Nemotron内容安全数据集V2（Nemotron Content Safety Dataset V2）  **Nemotron内容安全数据集V2（Nemotron Content Safety Dataset V2）**，前身为Aegis AI内容安全数据集2.0，包含33416条人机与大语言模型（Large Language Model）的标注交互数据，划分为30007条训练样本、1445条验证样本与1964条测试样本。本版本是此前发布的[Nemotron内容安全数据集V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)的扩展版本。为构建本数据集，我们采用了来自Anthropic HH-RLHF的人类偏好无害性数据的HuggingFace版本，仅提取其中的提示词，并由[Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)生成回复。相较于后续因安全对齐而对有害指令拒绝率更高的模型，Mistral v0.1在遵循指令（包括有害与无害指令）方面表现优异，因此能够为内容审核类别生成高质量的回复。本数据集遵循一套全面且可扩展的安全风险分类体系，包含12个顶级危险类别，并可扩展至9个细粒度子类别。数据集采用混合数据生成流程，将全对话级别的人类标注与多大语言模型“评审团”系统相结合，用于评估回复的安全性。 ## 数据集详情 ### 数据集描述我们整理的[训练数据集](../data/llama_training_ready/)由以下数据源收集或生成的混合数据组成： * 从[Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)、[Do-Anything-Now DAN](https://github.com/verazuo/jailbreak_llms)与[AI辅助红队测试](https://www.kaggle.com/datasets/mpwolke/aart-ai-safety)数据集收集的人类编写提示词 * 由[Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)生成的大语言模型回复 * 由3个大语言模型集成生成的回复安全标签：[Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)、[Mistral-NeMo-12B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct)与[Gemma-2-27B-it](https://huggingface.co/google/gemma-2-27b-it) * 我们额外混入了通过[Gemma-2-27B](https://build.nvidia.com/google/gemma-2-27b-it/modelcard.md)结合自定义提示工程转向策略生成的拒绝回复数据 * 进一步通过由[Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)生成的[主题遵循数据](https://gitlab-master.nvidia.com/dialogue-research/topic_following/-/tree/main?ref_type=heads)进行数据增强，可提升模型对自定义类别的鲁棒性与适配性。此外，针对“自杀与自伤”这一关键危害类别，我们根据本分类体系，从[自杀检测数据集](https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch?select=Suicide_Detection.csv)中标注了约358条样本。本数据集未包含这些样本，如需了解如何使用此类样本，请参见[数据集结构](#dataset-structure)章节。 **数据收集方式** * 混合模式：人类标注结合合成大语言模型增强数据 **标注方式** * 人类标注：针对对话的整体内容安全标签 * 合成标注：按需补充回复安全标签 **数据集属性** Aegis 2.0数据集包含25007条用户提示词、单轮用户提示词与大语言模型回复、多轮用户提示词与大语言模型回复。验证集包含1245条样本，测试集包含1964条样本，均来自上述分布。此外，训练数据集还包含5000条拒绝回复数据，这类回复通过自定义提示工程转向策略合成生成。验证集包含200条拒绝回复样本，我们未使用任何合成拒绝回复数据对预留测试集进行增强。为训练与本数据集同步发布的鲁棒性防护模型，我们额外使用了一类名为[主题控制数据集](https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset)的特定数据，该数据集包含约8000条样本，可通过此链接下载。 - **整理方**：英伟达NeMo防护团队（NeMo Guardrails team, Nvidia） - **相关论文**：[AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails](https://openreview.net/pdf?id=0MvGCv35wi) - **语言（自然语言处理）**：英语（可能包含少量其他语言样本） - **许可协议**：CC-BY-4.0 ## 使用场景 ### 直接使用场景 1. 为大语言模型构建内容安全审核防护栏，可用于训练提示词与回复审核模型。 2. 可结合安全-不安全偏好对数据，用于通用大语言模型的安全对齐训练。 ### 禁止使用场景本数据集包含可能具有冒犯性或令人不适的内容，涉及的主题包括但不限于歧视性语言、虐待、暴力、自伤、剥削及其他潜在令人不适的话题。请仅根据个人风险承受程度使用本数据集。本数据集仅用于研究目的，尤其是旨在降低模型危害性的研究。数据中表达的观点不代表英伟达及其员工的立场。本数据集不应用于训练AI智能体，否则可能导致模型产生有害行为。 ## 数据集结构本数据集的主分区现在包含最少9个列，列名与说明如下： - `id`：样本的唯一标识符 - `prompt`：首轮用户交互内容。若某样本存在于[自杀检测数据集](https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch?select=Suicide_Detection.csv)中的`reconstruction_id_if_redacted`列，则该`prompt`字段将显示为“已脱敏（REDACTED）”。此时，`reconstruction_id_if_redacted`列中的ID与自杀数据集中的样本ID一致，您需自行下载该数据集获取对应样本。 - `response`：首轮助手回复内容。若为仅含提示词的样本，则该字段为`null`；有时也可能使用空字符串`''`代替`null`，此时仍会提供回复标签。 - `prompt_label`：二进制安全标签 - `response_label`：二进制安全标签。若为仅含提示词的样本，则该字段为`null`。 - `prompt_label_source`：始终为`human`（人类标注） - `response_label_source`：取值为`human`（人类标注）、`llm_jury`（大语言模型评审团）或`refusal_data_augmentation`（拒绝回复数据增强）。 - `violated_categories`：按标注频率排序的逗号分隔类别列表。 ## 数据集构建 ### 标注流程本项目负责人负责维持质量保证（QA）流程。每周选取每100份由3名标注者完成的样本中的15份进行重新评估，频率为每周2至3次。这意味着至少有15%的数据会进行三方一致性分析，通常会覆盖20%至30%的数据，以进一步确保标注质量。修正结果将作为审计反馈发送给每位标注者，并结合项目指南解释修正的原因。数据集通常会被拆分为2000-4000条文本提示词，以3至5条为一批次交付给标注者。在每批次之间，项目负责人或指定负责人会预留至少一个完整的8小时工作日，供标注者自行修正分类结果。本项目期间已开展多轮培训，讲解自我修正的最佳实践，包括筛选数据中的关键词、参考定期更新的常见问题示例文档，以及随机选取已完成的提示词进行重新评估。标注者仅被允许修正自身的标注结果，不得查看其他标注者的工作。两位项目负责人随时可解答疑问或开展会议，以确保标注者对标注标准的理解一致。每两周或根据项目需要开展强制性线上集体培训，由负责人主持，通常会使用常见标注偏差案例作为学习素材。 #### 标注者背景在内容审核防护栏项目的三个月周期内，我们始终保持约12名标注者。其中4名标注者来自数据分析与收集、游戏开发及机器人领域的工程背景；8名标注者拥有创意写作背景，专业方向涵盖语言学、研发及摄影、电影等其他创意艺术领域。所有标注者均接受过关于大语言模型（LLM）及其他生成式AI的全面培训，能够生成创意文本并完成分类工作。12名标注者均居住在美国，来自不同的种族与宗教背景，覆盖了不同种族、年龄与社会阶层。 ## 内容安全分类体系 ### 非不安全类别 - 安全 - 需谨慎 ### 核心不安全类别 - 仇恨/身份仇恨 - 色情内容 - 自杀与自伤 - 暴力 - 枪支/非法武器 - 威胁 - 个人可识别信息/隐私 - 未成年人色情 - 犯罪策划/供认 - 骚扰 - 受管制/监管物质 - 亵渎语言 ### 细粒度不安全类别 - 非法活动 - 不道德/违反伦理 - 未经授权的建议 - 政治/错误信息/阴谋论 - 欺诈/欺骗 - 版权/商标/抄袭 - 高风险政府决策 - 恶意软件 - 操纵行为 ### 个人与敏感信息本数据集包含由[Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)生成的大语言模型回复。我们已对数据进行仔细排查，移除了所有可能包含个人信息的内容，但仍存在残留个人信息的可能性。若您发现数据中存在不应公开的内容，请立即告知我们。 ## 偏差、风险与局限性 - 安全与审核：本数据集旨在用于构建内容审核系统或对齐大语言模型以生成安全回复。鉴于本数据集的性质，其包含具有严重危害性的内容及对应的标注。在引用和使用本数据集时需格外谨慎。 - 法律合规：本数据集的使用者需确保其使用方式符合法律与伦理标准。 - 非身份识别：数据集使用者同意不得尝试识别数据中的个体身份。 ## 伦理声明 _Nemotron内容安全数据集V2_的创建过程遵循伦理数据标注规范，基于[Label Studio](http://label-studio.nvidia.com/user/login/)工具开发，该开源标注工具常用于英伟达内部项目。该工具允许标注者独立处理大规模数据，无需查看其他标注者的工作，这有助于避免标注者之间的偏差，同时为每位标注者提供多样化的提示词，避免因数据初始排列方式导致标注者重复处理相似任务。鉴于本项目的严肃性，标注者均基于自身技能水平、可参与时间及自愿接触潜在有毒内容的意愿受邀加入。项目开展前，所有参与者均签署了“成人内容知情同意书”，该文件与英伟达现有的《反骚扰政策》及《行为准则》保持一致。此举旨在确保所有标注者了解本项目的工作性质，以及若工作对其心理健康造成影响时可获取的支持资源。项目负责人会与每位标注者定期开展一对一会议，确认其是否仍能舒适地处理相关内容并继续开展工作。 ## 引用格式 @inproceedings{ghosh-etal-2025-aegis2, title = "{AEGIS}2.0: A Diverse {AI} Safety Dataset and Risks Taxonomy for Alignment of {LLM} Guardrails", author = "Ghosh, Shaona and Varshney, Prasoon and Sreedhar, Makesh Narsimhan and Padmakumar, Aishwarya and Rebedea, Traian and Varghese, Jibin Rajan and Parisien, Christopher", editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu", booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = apr, year = "2025", address = "Albuquerque, New Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.naacl-long.306/", doi = "10.18653/v1/2025.naacl-long.306", pages = "5992--6026", ISBN = "979-8-89176-189-6" } 数据集卡片作者与联系方式：<br> Shaona Ghosh, Prasoon Varshney <br> {shaonag,prasoonv}@nvidia.com

提供机构：

maas

创建时间：

2025-01-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集