Aegis-AI-Content-Safety-Dataset-1.0
收藏魔搭社区2026-01-06 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Aegis-AI-Content-Safety-Dataset-1.0
下载链接
链接失效反馈官方服务:
资源简介:
# 🛡️ Nemotron Content Safety Dataset V1
_Nemotron Content Safety Dataset V1_, formerly known as _Aegis AI Content Safety Dataset_, is an open-source content safety dataset (CC-BY-4.0), which adheres to Nvidia's content safety taxonomy, covering 13 critical risk categories (see [Dataset Description](#dataset-description)).
## Dataset Details
### Dataset Description
_Nemotron Content Safety Dataset V1_ is comprised of approximately `11,000` manually annotated interactions between humans and LLMs, split into `10,798` training samples and `1,199` test samples.
To curate the dataset, we use the Hugging Face version of human preference data about harmlessness from [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf). We extract only the prompts, and elicit responses from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). Mistral excels at instruction following and generates high quality responses for the content moderation categories. We use examples in the system prompt to ensure diversity by instructing Mistral to not generate similar responses. Our data comprises four different formats: user prompt only, system prompt with user prompt, single turn user prompt with Mistral response, and multi-turn user prompt with Mistral responses.
The samples are annotated with respect to the following taxonomy.
PLEASE NOTE: The samples are only annotated at the dialog level. If the dialog consists of a single prompt only, then the annotations are valid for that prompt. Otherwise, the annotations are only applicable for prompt(s) + response(s), at a partial or whole dialog level.
### Nvidia's Content Safety Taxonomy
| Harm Category |
| --- |
| Hate /Identity Hate |
| Sexual |
| Violence |
| Suicide and Self Harm |
| Threat |
| Sexual Minor |
| Guns /Illegal Weapons |
| Controlled /Regulated substances |
| Criminal Planning /Confessions |
| PII |
| Harassment |
| Profanity |
| Other |
| Needs Caution |
- **Curated by:** Shaona Ghosh, Nvidia.
- **Paper:** [AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts](https://arxiv.org/abs/2404.05993)
- **Language(s) (NLP):** English (may contain small samples in other languages).
- **License:** CC-BY-4.0
### Dataset Sources
1. We use the Hugging Face version of human preference data about harmlessness from [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf).
2. We extract only the prompts, and elicit responses from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
3. Annotation was performed by a team of `twelve` annotators along with `two` data quality assurance persons, using data teams at Nvidia.
## Uses
### Direct Use
1. To build content moderation guardrails around LLMs.
2. Aligning LLMs to generate safe reponses using SFT or DPO methods.
### Out-of-Scope Use
The data contain content that may be offensive or upsetting. Topics include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially upsetting subject matter. Please only engage with the data in accordance with your own personal risk tolerance. The data are intended for research purposes, especially research that can make models less harmful. The views expressed in the data do not reflect the views of Nvidia or any of its employees. These data are not intended for training dialogue agents as this will likely lead to harmful model behavior.
## Dataset Creation
### Annotation process
Quality Assurance (QA) is maintained by the leads of this project. Two to three times
per week, leads choose fifteen questions at random of every one hundred completed by
three annotators to reevaluate. This accounts for fifteen percent of the data analyzed
for three-way agreement at a minimum, often having at least twenty to thirty percent
analyzed to further ensure quality. These corrections are sent to each individual annotator
as audits, with brief explanations of why certain corrections were made with reference to
the project guidelines. Data sets are commonly broken into 2,000-4,000 text-based prompts
and delivered to annotators in batches of three to five. In the transitions between each
batch, Person In Charge (PIC) or the lead designate at least one full eight hour work day
for annotators to self-correct their categorization work. Many training sessions have been
held throughout the duration of this project for tips on best practices when self-correcting,
including filtering through key words in the data, referencing the regularly updated FAQ
sheet with example questions, and choosing completed prompts at random to reevaluate.
Annotators are instructed to only self-correct their own work and avoid looking at any
other annotations besides their own. Both Leads are also available at all times for further
questions or meetings to ensure consistent understanding of the material. Mandatory virtual
group training are held every two weeks or as needed depending on the circumstances of
the project. These meetings are led by leads and often utilize examples of commonly seen
discrepancies to present as learning opportunities.
#### Who are the annotators?
Throughout the three month time span of the Content Moderation Guardrails project, we
have averaged twelve annotators at any given time. Of these twelve, four annotators come
from Engineering backgrounds specializing in data analysis and collection, gaming, and
robotics. Eight annotators have a background in Creative Writing, with specialization in
linguistics, research and development, and other creative arts such as photography and
film. All annotators have been extensively trained in working with Large Language Models
(LLM), as well as other variations of Generative AI such as image retrieval or evaluations
of multi-turn conversations. All are capable of generating creative text-based output and
categorization work. Each of these twelve annotators resides in the United States, all from
various ethnic and religious backgrounds that allow for representation across race, age, and
social status.
### Personal and Sensitive Information
This dataset comprises LLM responses generated from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) . We have carefully gone through the data and taken out anything that could have personal information in it. However, there is still a chance that some personal information might be left in the data. If you come across anything in the data that you think should not be made public, please let us know right away.
## Bias, Risks, and Limitations
- Safety and Moderation: This dataset is intended to be used for building content moderations systems, or aligning LLMs to generate safe responses. By the nature of the work, the dataset contains critically unsafe content and annotations for that content. Extreme care and caution should be exercised in referring and using this dataset.
- Legal Compliance: Users of this data are responsible for ensuring its appropriate use. The dataset should not be utilized in manners that conflict with legal and ethical standards.
- Non-Identification: Users of this data agree to not attempt to determine the identity of individuals in this dataset.
## Ethical Statement
The process in which the _Nemotron Content Safety Dataset V1_ creation abides by ethical data categorization work is based within the tooling of [Label Studio](http://label-studio.nvidia.com/user/login/), an open source data labeling tool often used for Nvidia's internal projects. This tooling technology allows for large sets of data to be analyzed by individual annotators without seeing the work of their peers. This is essential in preventing bias between annotators, as well as delivering prompts to each individual with variability so that no one annotator is completing similar tasks based
on how the data was initially arranged.
Due to the serious nature of this project, annotators were asked to join on a volunteer basis
based on their skill level, availability, and willingness to expose themselves to potentially
toxic content. Before work on this project began, all participants were asked to sign an
“_Adult Content Acknowledgement_” that coincides with the organization’s existing _AntiHarassment Policy_ and _Code of Conduct_. This was to ensure that all annotators be made aware of the nature of this work, as well as the resources available to them should it affect their mental well-being. Regular 1:1 meetings were held between the leads assigned to this project and each annotator to make sure they are still comfortable with the material and are capable of continuing on with this type of work.
## Citation
**BibTeX:**
```
@article{ghosh2024aegis,
title={AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts},
author={Ghosh, Shaona and Varshney, Prasoon and Galinkin, Erick and Parisien, Christopher},
journal={arXiv preprint arXiv:2404.05993},
year={2024}
}
```
## Dataset Card Authors
Shaona Ghosh, shaonag@nvidia.com
## Dataset Card Contact
shaonag@nvidia.com
# 🛡️ Nemotron 内容安全数据集 V1
_Nemotron 内容安全数据集 V1(Nemotron Content Safety Dataset V1)_,前身为 _Aegis AI 内容安全数据集_,是一款遵循英伟达(Nvidia)内容安全分类体系的开源内容安全数据集(CC-BY-4.0),涵盖13类关键风险类别(详见[数据集说明](#dataset-description))。
## 数据集详情
### 数据集说明
_Nemotron 内容安全数据集 V1_ 包含约11000条人工标注的人机与大语言模型(Large Language Model, LLM)交互数据,划分为10798条训练样本与1199条测试样本。
为构建该数据集,我们使用了来自 [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) 的 Hugging Face 版无害性人类偏好数据。我们仅提取其中的提示词,并由 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) 生成回复。Mistral-7B-v0.1 擅长遵循指令,能够为内容审核类别生成高质量回复。我们通过系统提示词中的示例指导 Mistral-7B-v0.1 避免生成相似回复,以确保数据多样性。本数据集包含四种格式:仅用户提示词、带系统提示词的用户提示词、单轮用户提示词搭配 Mistral 回复、多轮用户提示词搭配 Mistral 回复。
样本按照以下分类体系进行标注:
请注意:样本仅在对话层面进行标注。若对话仅包含单条提示词,则标注适用于该提示词;若对话包含多轮内容,则标注仅适用于提示词+回复组成的部分或完整对话层面。
### 英伟达内容安全分类体系
| 有害类别 |
| --- |
| 仇恨/身份仇恨(Hate /Identity Hate) |
| 色情内容(Sexual) |
| 暴力内容(Violence) |
| 自杀与自伤(Suicide and Self Harm) |
| 威胁(Threat) |
| 未成年人色情(Sexual Minor) |
| 枪支/非法武器(Guns /Illegal Weapons) |
| 受管制物质(Controlled /Regulated substances) |
| 犯罪策划/供认(Criminal Planning /Confessions) |
| 个人可识别信息(PII) |
| 骚扰(Harassment) |
| 亵渎性语言(Profanity) |
| 其他(Other) |
| 需谨慎处理(Needs Caution) |
- **数据集整理者:** Shaona Ghosh,英伟达(Nvidia)。
- **相关论文:** [AEGIS:基于大语言模型专家集成的在线自适应AI内容安全审核](https://arxiv.org/abs/2404.05993)
- **语言(自然语言处理):** 英语(可能包含少量其他语言样本)。
- **授权协议:** CC-BY-4.0
### 数据集来源
1. 我们使用了来自 [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) 的 Hugging Face 版无害性人类偏好数据。
2. 我们仅提取其中的提示词,并由 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) 生成回复。
3. 标注工作由英伟达的数据团队中的12名标注人员与2名数据质量保障人员共同完成。
## 使用场景
### 直接使用场景
1. 为大语言模型搭建内容审核护栏。
2. 通过监督微调(SFT)或直接偏好优化(DPO)方法对齐大语言模型,使其生成安全回复。
### 超出范围的使用场景
本数据集包含可能具有冒犯性或令人不适的内容,主题包括但不限于歧视性语言、虐待、暴力、自伤、剥削及其他潜在令人不适的议题。请仅根据自身风险承受程度使用该数据集。本数据集仅用于研究目的,尤其是旨在降低模型危害性的研究。数据中表达的观点不代表英伟达及其雇员的立场。本数据集不应用于训练对话智能体,因为这可能导致模型产生有害行为。
## 数据集构建
### 标注流程
数据质量保障(QA)工作由本项目负责人负责。每周随机选取每百名标注人员完成的15条问题进行重新评估,该流程每周进行2至3次。这至少覆盖了15%的数据分析以确保三方一致性,通常会覆盖20%至30%的数据以进一步保障质量。这些修正结果会作为审计反馈给每位标注人员,并结合项目指南简要解释修正原因。数据集通常会被拆分为2000-4000条文本提示词,以3至5条为一批次交付给标注人员。在每批次之间,项目负责人(Person In Charge, PIC)或指定负责人会为标注人员留出至少一整个工作日的时间用于自我修正分类结果。本项目期间开展了多次培训课程,讲解自我修正的最佳实践,包括筛选数据中的关键词、参考定期更新的常见问题示例清单,以及随机选取已完成的提示词进行重新评估。标注人员仅被允许修正自身的标注结果,不得查看其他标注人员的工作。两位负责人随时待命解答疑问或召开会议,以确保所有人员对标注标准的理解一致。每两周或根据项目需要召开强制性虚拟集体培训,由负责人主持,通常会利用常见的标注差异案例作为学习素材。
#### 标注人员背景
在内容审核护栏项目的三个月周期内,我们始终保持约12名标注人员。其中4名标注人员拥有数据分析与采集、游戏及机器人领域的工程背景;8名标注人员拥有创意写作背景,专长于语言学、研发及摄影、电影等其他创意艺术领域。所有标注人员均接受过关于大语言模型(Large Language Model, LLM)及其他生成式AI(Generative AI)工具的全面培训,如图像检索或多轮对话评估。他们均能够生成创意文本并完成分类工作。这12名标注人员均居住在美国,来自不同的种族与宗教背景,能够代表不同的种族、年龄与社会阶层。
### 个人与敏感信息
本数据集包含由 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) 生成的大语言模型回复。我们已对数据进行了仔细筛查,移除了所有可能包含个人信息的内容。但仍有可能存在少量个人信息残留。若您在数据中发现任何不应公开的内容,请立即告知我们。
## 偏差、风险与局限性
- 安全与审核:本数据集旨在用于搭建内容审核系统,或对齐大语言模型以生成安全回复。由于工作性质使然,本数据集包含极具危害性的内容及其标注。在引用和使用本数据集时,请务必极度谨慎。
- 合规性:本数据集的使用者需确保其使用方式合规。不得将本数据集用于违反法律与伦理标准的用途。
- 非身份识别:本数据集的使用者同意不得尝试识别数据中的个体身份。
## 伦理声明
_Nemotron 内容安全数据集 V1_ 的构建流程遵循伦理化数据标注工作规范,基于 [Label Studio](http://label-studio.nvidia.com/user/login/) 工具开发——这是一款常用于英伟达内部项目的开源数据标注工具。该工具允许标注人员独立分析大规模数据集,无需查看其他标注人员的工作成果,这对防止标注人员之间产生偏见至关重要,同时也能确保每个标注人员收到的提示词具有多样性,避免因数据初始排布导致标注人员重复完成相似任务。
鉴于本项目的严肃性,标注人员基于自身技能水平、可参与时间以及是否愿意接触潜在有毒内容,以志愿者身份参与项目。在项目开始前,所有参与者均签署了“成人内容知情同意书”,该文件与本组织现有的《反骚扰政策》及《行为准则》保持一致。此举旨在确保所有标注人员充分了解本项工作的性质,以及在工作影响心理健康时可获取的支持资源。项目负责人与每位标注人员定期开展一对一会议,以确认他们是否仍能适应相关内容,并能够继续开展此类工作。
## 引用方式
**BibTeX格式引用:**
@article{ghosh2024aegis,
title={AEGIS:基于大语言模型专家集成的在线自适应AI内容安全审核},
author={Ghosh, Shaona and Varshney, Prasoon and Galinkin, Erick and Parisien, Christopher},
journal={arXiv preprint arXiv:2404.05993},
year={2024}
}
## 数据集卡片作者
Shaona Ghosh, shaonag@nvidia.com
## 数据集卡片联系方式
shaonag@nvidia.com
提供机构:
maas
创建时间:
2025-01-20
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个用于内容安全审核的开源数据集,包含约11,000个人工标注的LLM交互样本,涵盖13个风险类别,主要用于构建内容审核护栏和对齐LLM生成安全响应。
以上内容由遇见数据集搜集并总结生成



