Nemotron-Content-Safety-Reasoning-Dataset
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-Content-Safety-Reasoning-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Nemotron Content Safety Reasoning Dataset
The Nemotron Content Safety Reasoning Dataset contains reasoning traces generated from open source reasoning models to provide justifications for labels in two existing datasets released by NVIDIA: Nemotron Content Safety Dataset V2 and CantTalkAboutThis Topic Control Dataset. The reasoning contains justifications for labels of either stand-alone user prompts engaging with an LLM or pairs of user prompts and LLM responses that are either safe/unsafe or on-topic/off-topic. This dataset has been used to train the [Nemotron-Content-Safety-Reasoning-4B](https://huggingface.co/nvidia/Nemotron-Content-Safety-Reasoning-4B) guard model.
The reasoning traces are extracted from [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528), [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B), and [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) models using specific prompting strategies and filtering as discussed in the accompanying [paper](https://arxiv.org/abs/2505.20087). The dataset includes two types of reasoning traces: full-length and reduced-length (efficient) reasoning traces for optimized inference performance.
**Language(s)**: English
### Dataset Sources
- **Paper**: [Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models](https://arxiv.org/abs/2505.20087)
- **Nemotron Content Safety Dataset V2**: [https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
- **CantTalkAboutThis Topic Control Dataset**: [https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset](https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset)
This dataset is ready for commercial use.
## Dataset Owner
NVIDIA Corporation
## Dataset Creation Date
November 2025
## License/Terms of Use
Use of this dataset is governed by the Creative Commons Attribution 4.0 International License (CC-BY 4.0).
## Intended Usage
### Direct Usage
This dataset is intended for training reasoning guard classifier models for LLM content safety and topic control. Broadly, the intended use cases are:
- Training reasoning-based guardrail models to detect unsafe content in LLM interactions
- Developing topic control mechanisms that maintain topical coherence in conversational AI
- Fine-tuning reasoning models for alignment tasks (e.g., Nemotron)
- Deploying reasoning guard classifiers in products like NeMo Guardrails
- Evaluating AI systems' ability to provide justified safety classifications
### Out-of-Scope Usage
This dataset should not be used for harmful, unethical, or malicious purposes. This dataset should not be used for:
- Training models to generate harmful or inappropriate content
- Bypassing content moderation systems
- Creating adversarial examples to test system vulnerabilities without proper authorization
- Developing systems that enable unsafe content generation
### Curation Rationale
The dataset addresses the need for explainable and reasoning-based guardrail models in AI safety. Traditional guardrail models provide classification decisions without justification, making it difficult to understand their decision-making process. By incorporating reasoning traces from state-of-the-art reasoning models, this dataset enables the development of guardrail systems that can:
1. Provide transparent explanations for safety classifications
2. Improve trust and interpretability in content moderation systems
3. Enable fine-tuning of reasoning models for alignment tasks
4. Support research into reasoning-based approaches for AI safety
5. Enable adaptability to custom safety policies
## Dataset Characterization
This dataset augments two existing NVIDIA datasets ([Nemotron Content Safety Dataset V2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) and [CantTalkAboutThis Topic Control Dataset](https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset)) with reasoning traces from state-of-the-art reasoning models.
### Generator Models
Reasoning traces were generated using three open-source reasoning models:
- [**DeepSeek-R1-0528**](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
- [**Qwen3-32B**](https://huggingface.co/Qwen/Qwen3-32B)
- [**gpt-oss-120b**](https://huggingface.co/openai/gpt-oss-120b)
These models were selected for their strong reasoning capabilities and ability to provide detailed explanations for their classification decisions.
### Generation Process
The reasoning traces were generated using specific prompting strategies as described in the [accompanying paper](https://arxiv.org/abs/2505.20087). The generation process includes:
1. **Full-length reasoning traces**: Complete reasoning chains that explain the model's thought process in detail
2. **Efficient reasoning traces**: Reduced-length versions optimized for faster inference
### Quality Filtering
To ensure dataset quality and consistency, samples were filtered based on prediction accuracy. Only samples where the reasoning model's predicted labels matched the ground truth labels from the base datasets were retained. This filtering process resulted in:
- **28,000 samples** from Nemotron Content Safety Dataset V2 (out of 30,007 training samples)
- **8,712 samples** from CantTalkAboutThis topic control dataset (out of 1,080 dialogues with multiple turns)
For each sample, the dataset includes reasoning traces from all three generator models, along with a `prediction_gt_mismatch_{generator_model}` field indicating whether each model's prediction aligns with the ground truth.
### Source Data
The dataset is created by augmenting two existing NVIDIA datasets with reasoning traces:
#### Base Dataset 1: Nemotron Content Safety Dataset V2
**[Nemotron Content Safety Dataset V2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)** contains 33,416 annotated interactions between humans and LLMs, split into 30,007 training samples, 1,445 validation samples, and 1,964 test samples.
**Data Collection Method:**
- Hybrid: Human annotations with synthetic LLM augmentations
**Original Data Sources:**
- Human-written prompts collected from the Anthropic RLHF, Do-Anything-Now DAN, and AI-assisted Red-Teaming datasets
- LLM responses generated from Mistral-7B-v0.1
- Response safety labels generated using ensemble of 3 LLMs: Mixtral-8x22B-v0.1, Mistral-NeMo-12B-Instruct, and Gemma-2-27B-it
- Additional refusal data generated using Gemma-2-27B using custom prompt-engineered deflection strategies
**Labeling Method:**
- Hybrid: Human, Synthetic
-- Human - Overall content safety labels for the conversation
-- Synthetic - Response safety label augmentation where needed
**Annotation Process:**
Quality Assurance (QA) is maintained through regular review by project leads. Two to three times per week, leads randomly select fifteen questions of every one hundred completed by three annotators for reevaluation, accounting for at least fifteen percent of the data analyzed for three-way agreement. These corrections are sent to individual annotators as audits with explanations referencing project guidelines. Mandatory virtual group training sessions are held every two weeks to ensure consistent understanding.
**Annotators:**
Throughout the three-month timespan of the Content Moderation Guardrails project, twelve annotators on average were employed at any given time. Four annotators come from Engineering backgrounds specializing in data analysis and collection, gaming, and robotics. Eight annotators have backgrounds in Creative Writing, with specialization in linguistics, research and development, and other creative arts. All annotators reside in the United States, from various ethnic and religious backgrounds, providing representation across race, age, and social status. All annotators volunteered for this work after being informed of its potentially sensitive nature.
#### Base Dataset 2: CantTalkAboutThis
**[CantTalkAboutThis](https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset)** contains 1,080 dialogues with distractor turns across nine domains (health, banking, travel, education, finance, insurance, legal, real estate, and computer troubleshooting).
**Data Collection Method:**
- Synthetic
**Original Data Sources:**
The dataset was created using a pipeline to synthetically generate conversations and distractors, as described in the accompanying paper. This commercially-friendly version was generated using the Mixtral-8x7B-Instruct model. An evaluation dataset that is human annotated with more complex, realistic distractors is additionally provided for evaluating model performance.
## Dataset Format
Text, Json.
The dataset consists of two separate files, each corresponding to one of the base datasets with added reasoning traces:
### 1. Nemotron Content Safety Reasoning Dataset (Nemotron Content Safety Dataset V2)
This dataset contains 28,000 samples with reasoning traces for content safety classification. Each record includes the following fields:
#### Existing Fields from Nemotron Content Safety Dataset V2:
- **id**: The unique ID for the sample
- **reconstruction_id_if_redacted**: ID for reconstructed samples if content was redacted. The prompt will say "REDACTED" if an ID exists in this column for redacted samples from the Suicide Detection dataset. In this case, the ID specified matches the sample in the Suicide dataset, which needs to be separately downloaded.
- **prompt**: The first user turn
- **response**: The first assistant turn. `null` if prompt-only example. Sometimes can have an empty string `''` instead of a null, in which case the response label would be present.
- **prompt_label**: Binary safety label (safe/unsafe)
- **response_label**: Binary safety label (safe/unsafe). `null` if prompt-only example.
- **prompt_label_source**: Always human annotated
- **response_label_source**: Either human, llm_jury, or refusal_data_augmentation
- **violated_categories**: Comma-separated list of categories in order of frequency of annotations
#### New Fields Added for Reasoning Traces:
- **prompt_harm_label**: Binary harm classification for prompt (harmful/unharmful)
- **response_harm_label**: Binary harm classification for response (harmful/unharmful)
- **safety_reasoning_{generator_model}**: Full-length reasoning trace from the specified reasoning model, explaining the safety classification decision
- **efficient_reasoning_{generator_model}**: Reduced-length reasoning trace from the specified reasoning model for efficient inference
- **prediction_gt_mismatch_{generator_model}**: Boolean indicating whether the reasoning model's prediction matches the ground truth label
Note: `{generator_model}` can be one of: `deepseek_r1_0528`, `gpt_oss_120b`, or `qwen3_32b`
### 2. Topic Control Reasoning Dataset (CantTalkAboutThis)
This dataset contains 8,712 samples with reasoning traces for topic control classification. The dataset includes dialogues across nine domains: health, banking, travel, education, finance, insurance, legal, real estate, and computer troubleshooting. Each record includes the following fields:
#### Existing Fields from CantTalkAboutThis Dataset:
- **system_instruction**: The dialogue policy given to the model, usually a complex set of instructions on topics allowed and not allowed
- **conversation_history**: The full conversation history leading up to the current user message, including both the main topic and distractor turns (list of role and content pairs)
- **last_user_message**: The current user message to be classified for topic adherence
- **gt_topic_classification**: Ground truth classification (on-topic/off-topic)
#### New Fields Added for Reasoning Traces:
- **safety_reasoning_{generator_model}**: Full-length reasoning trace from the specified reasoning model, explaining the topic classification decision
- **efficient_reasoning_{generator_model}**: Reduced-length reasoning trace from the specified reasoning model for efficient inference
- **prediction_gt_mismatch_{generator_model}**: Boolean indicating whether the reasoning model's prediction matches the ground truth label
Note: `{generator_model}` can be one of: `deepseek_r1_0528`, `gpt_oss_120b`, or `qwen3_32b`
## Dataset Quantification
This dataset contains 28,000 samples, each with 14 features/attributes, with reasoning traces for content safety classification (Nemotron Content Safety Dataset V2 Safety Reasoning Dataset) and 8,712 samples, each with 7 features/attributes, with reasoning traces for topic control classification (Topic Control Reasoning Dataset).
Total data storage: Less than 250 MB.
## Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
### Personal and Sensitive Information
**Nemotron Content Safety Dataset V2 Safety Reasoning Dataset:**
The Nemotron Content Safety Dataset V2 portion has been carefully reviewed to remove personal information.
**Topic Control Reasoning Dataset (CantTalkAboutThis Dataset):**
The topic control portion does not contain any real-world personal or sensitive data.
### Bias, Risks, and Limitations
**Biases**:
- The reasoning traces reflect the biases and reasoning patterns of the specific models used (DeepSeek-R1, Qwen3-32B, gpt-oss).
- The base datasets (Nemotron Content Safety Dataset V2 and CantTalkAboutThis) may contain inherent biases from their curation process.
**Risks**:
- Reasoning traces may not generalize to all types of safety violations or off-topic scenarios.
- The reduced-length reasoning traces may lose important context or nuance compared to full-length traces.
- Models trained on this data may inherit reasoning patterns that don't align with all safety frameworks.
**Limitations**:
- The dataset is limited to English language content (may contain small samples in other languages in the Nemotron Content Safety Dataset V2 portion).
- Coverage is restricted to the safety categories and domains present in the base datasets.
- The reasoning quality depends on the capabilities of the source reasoning models.
**Legal Compliance**:
Users of this data are responsible for ensuring its appropriate use. The dataset should not be utilized in manners that conflict with legal and ethical standards.
## Recommendations
Users should be made aware of the risks, biases, and limitations of the dataset. When using this dataset to train guardrail models:
- Validate model performance on diverse test sets beyond this dataset
- Consider combining with other safety datasets for broader coverage
- Monitor for potential biases in reasoning patterns
- Ensure human oversight for high-stakes safety decisions
- Regularly update models as reasoning capabilities improve
## Citation
**BibTeX:**
```
@inproceedings{sreedhar-etal-2025-safety,
title = "Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models",
author = "Sreedhar, Makesh Narsimhan and
Rebedea, Traian and
Parisien, Christopher",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.1193/",
pages = "21862--21880",
ISBN = "979-8-89176-335-7",
abstract = "Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems."
}
```
## Dataset Card Authors
* Makesh Narsimhan Sreedhar
* Traian Rebedea
* Christopher Parisien
## Dataset Card Contact
* Makesh Narsimhan Sreedhar (makeshn@nvidia.com)
* Traian Rebedea (trebedea@nvidia.com)
# Nemotron内容安全推理数据集
本数据集源自NVIDIA发布的两款现有数据集——《Nemotron内容安全数据集V2》与《CantTalkAboutThis主题管控数据集》,收录由开源推理模型生成的推理轨迹,用于为数据集标签提供合理性佐证。所生成的推理轨迹可为两类样本的标签提供佐证:一是与大语言模型(Large Language Model)交互的独立用户提示,二是包含用户提示与大语言模型回复的对话对,此类样本的标签分为安全/不安全、合规/不合规两类。本数据集已用于训练[Nemotron-Content-Safety-Reasoning-4B](https://huggingface.co/nvidia/Nemotron-Content-Safety-Reasoning-4B)防护模型。
本数据集的推理轨迹源自[DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)、[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)以及[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)三款模型,生成过程采用了配套论文[https://arxiv.org/abs/2505.20087](https://arxiv.org/abs/2505.20087)中提及的特定提示策略与筛选机制。数据集包含两类推理轨迹:为优化推理性能设计的全长度推理轨迹与精简长度(高效)推理轨迹。
**语言**:英语
### 数据集来源
- **相关论文**:[基于推理实现安全:推理式防护模型的实证研究](https://arxiv.org/abs/2505.20087)
- **Nemotron内容安全数据集V2**:[https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
- **CantTalkAboutThis主题管控数据集**:[https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset](https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset)
本数据集可商用。
### 数据集所有者
NVIDIA公司
### 数据集创建日期
2025年11月
### 使用许可与条款
本数据集的使用受知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License,CC-BY 4.0)约束。
### 预期用途
#### 直接用途
本数据集旨在训练用于大语言模型内容安全与主题管控的推理式防护分类模型,具体应用场景包括:
- 训练基于推理的防护模型,以检测大语言模型交互中的不安全内容
- 开发主题管控机制,维持对话式AI的主题连贯性
- 针对对齐任务微调推理模型(例如Nemotron系列模型)
- 在NeMo防护栏等产品中部署推理式防护分类模型
- 评估AI系统提供合理安全分类依据的能力
#### 禁止使用场景
本数据集不得用于有害、不道德或恶意用途,具体禁止场景包括:
- 训练模型生成有害或不当内容
- 绕过内容审核系统
- 创建对抗样本以测试系统漏洞(未获得授权时)
- 开发可生成不安全内容的系统
### 数据集构建逻辑
本数据集旨在满足AI安全领域对可解释、基于推理的防护模型的需求。传统防护模型仅提供分类决策而无解释依据,导致难以理解其决策逻辑。本数据集通过引入当前顶尖推理模型的推理轨迹,支持开发具备以下能力的防护系统:
1. 为安全分类提供透明的解释依据
2. 提升内容审核系统的可信度与可解释性
3. 支持针对对齐任务的推理模型微调
4. 为AI安全领域的推理式方法研究提供支撑
5. 支持适配自定义安全策略
### 数据集特征
本数据集基于NVIDIA的两款现有数据集——《Nemotron内容安全数据集V2》与《CantTalkAboutThis主题管控数据集》构建,通过添加顶尖推理模型的推理轨迹对其进行扩充。
#### 生成模型
推理轨迹由三款开源推理模型生成:
- [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
- [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)
- [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)
上述模型因具备出色的推理能力与分类解释能力而被选中。
#### 生成流程
推理轨迹的生成采用了配套论文中提及的特定提示策略,生成流程包括:
1. **全长度推理轨迹**:完整的推理链,详细解释模型的思考过程
2. **高效推理轨迹**:精简长度的推理轨迹,针对快速推理进行优化
#### 质量筛选
为确保数据集的质量与一致性,样本基于预测准确率进行筛选。仅保留推理模型预测标签与基础数据集真实标签一致的样本。本次筛选结果如下:
- 从Nemotron内容安全数据集V2的30007条训练样本中筛选出28000条样本
- 从CantTalkAboutThis主题管控数据集的1080轮对话中筛选出8712条样本
每个样本均包含三款生成模型的推理轨迹,同时附带`prediction_gt_mismatch_{生成模型}`字段,用于标识对应模型的预测结果是否与真实标签一致。
#### 源数据集
本数据集通过为两款现有NVIDIA数据集添加推理轨迹构建而成:
##### 基础数据集1:Nemotron内容安全数据集V2
**[Nemotron内容安全数据集V2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)** 包含33416条人机与大语言模型的交互标注样本,划分为30007条训练样本、1445条验证样本与1964条测试样本。
**数据采集方式**:
混合模式:人工标注结合大语言模型合成数据增强
**原始数据来源**:
- 从Anthropic RLHF、Do-Anything-Now DAN以及AI辅助红队数据集收集的人工编写提示
- 由Mistral-7B-v0.1生成的大语言模型回复
- 由Mixtral-8x22B-v0.1、Mistral-NeMo-12B-Instruct与Gemma-2-27B-it三款模型集成生成的回复安全标签
- 采用Gemma-2-27B结合自定义提示工程偏转策略生成的额外拒答数据
**标注方式**:
混合模式:人工标注、合成数据标注
- 人工标注:对话的整体内容安全标签
- 合成数据标注:按需补充回复安全标签
**标注流程**:
通过项目负责人定期审查维持标注质量保证(QA)。每100份完成的标注样本中,负责人每周随机选取15份由三名标注员重新评估,至少覆盖15%的分析数据以确保三方一致性。修正结果将作为审计反馈给对应标注员,并附上符合项目指南的解释说明。每两周开展一次强制线上集体培训,以确保标注标准统一。
**标注人员**:
在内容审核防护栏项目的三个月周期内,平均同时雇佣12名标注员。其中4名标注员具备数据分析、采集、游戏与机器人领域的工程背景;8名标注员具备创意写作背景,专精于语言学、研发与其他创意艺术领域。所有标注员均居住在美国,来自不同种族与宗教背景,覆盖不同种族、年龄与社会阶层。所有标注员在知晓工作可能涉及敏感内容后自愿参与本次任务。
##### 基础数据集2:CantTalkAboutThis主题管控数据集
**[CantTalkAboutThis主题管控数据集](https://huggingface.co/datasets/nvidia/CantTalkAboutThis-Topic-Control-Dataset)** 包含1080轮带有干扰轮次的对话,涵盖健康、银行、旅行、教育、金融、保险、法律、房地产与计算机故障排查九个领域。
**数据采集方式**:
合成模式
**原始数据来源**:
本数据集通过配套论文中提及的对话与干扰项合成流水线构建。本次商用友好版本由Mixtral-8x7B-Instruct模型生成。此外还提供了经人工标注的、包含更复杂真实干扰项的评估数据集,用于模型性能测试。
### 数据集格式
文本、JSON格式。
本数据集包含两个独立文件,分别对应两款基础数据集并添加了推理轨迹:
#### 1. Nemotron内容安全推理数据集(Nemotron内容安全数据集V2版本)
本数据集包含28000条样本,每条样本均带有内容安全分类的推理轨迹。每条记录包含以下字段:
##### 源自Nemotron内容安全数据集V2的原有字段
- **id**:样本唯一标识符
- **reconstruction_id_if_redacted**:内容被脱敏时的重构样本ID。若自杀检测数据集中的样本被脱敏,提示中将显示"REDACTED",此时该ID与自杀检测数据集中的样本对应,需单独下载。
- **prompt**:首条用户轮次内容
- **response**:首条助手轮次内容。仅用户提示样本的该字段为`null`,部分场景下可能以空字符串`''`替代`null`,此时回复标签依然存在。
- **prompt_label**:二分类安全标签(安全/不安全)
- **response_label**:二分类安全标签(安全/不安全)。仅用户提示样本的该字段为`null`。
- **prompt_label_source**:始终为人工标注
- **response_label_source**:可选值为人工标注、llm_jury或refusal_data_augmentation
- **violated_categories**:按标注频率排序的违规类别逗号分隔列表
##### 新增的推理轨迹相关字段
- **prompt_harm_label**:用户提示的二分类伤害分类标签(有害/无害)
- **response_harm_label**:助手回复的二分类伤害分类标签(有害/无害)
- **safety_reasoning_{生成模型}**:指定推理模型生成的全长度推理轨迹,用于解释安全分类决策
- **efficient_reasoning_{生成模型}**:指定推理模型生成的精简长度推理轨迹,用于高效推理
- **prediction_gt_mismatch_{生成模型}**:布尔值,用于标识推理模型的预测结果是否与真实标签一致
注:`{生成模型}`可选值为`deepseek_r1_0528`、`gpt_oss_120b`或`qwen3_32b`
#### 2. 主题管控推理数据集(CantTalkAboutThis版本)
本数据集包含8712条样本,每条样本均带有主题管控分类的推理轨迹。数据集涵盖九个领域的对话:健康、银行、旅行、教育、金融、保险、法律、房地产与计算机故障排查。每条记录包含以下字段:
##### 源自CantTalkAboutThis数据集的原有字段
- **system_instruction**:给定给模型的对话策略,通常为包含允许与禁止话题的复杂指令集
- **conversation_history**:当前用户消息之前的完整对话历史,包含主话题与干扰轮次(角色与内容对列表)
- **last_user_message**:待分类主题合规性的当前用户消息
- **gt_topic_classification**:真实主题分类标签(合规/不合规)
##### 新增的推理轨迹相关字段
- **safety_reasoning_{生成模型}**:指定推理模型生成的全长度推理轨迹,用于解释主题分类决策
- **efficient_reasoning_{生成模型}**:指定推理模型生成的精简长度推理轨迹,用于高效推理
- **prediction_gt_mismatch_{生成模型}**:布尔值,用于标识推理模型的预测结果是否与真实标签一致
注:`{生成模型}`可选值为`deepseek_r1_0528`、`gpt_oss_120b`或`qwen3_32b`
### 数据集规模
本数据集包含28000条样本(每条样本含14个特征/属性),对应内容安全分类推理轨迹(Nemotron内容安全数据集V2安全推理数据集),以及8712条样本(每条样本含7个特征/属性),对应主题管控分类推理轨迹(主题管控推理数据集)。
总数据存储量:小于250 MB。
### 伦理考量
NVIDIA认为可信AI是一项共同责任,我们已制定相关政策与实践规范,以支持各类AI应用的开发。开发者在遵循本数据集服务条款下载或使用本数据集时,应与内部模型团队协作,确保本模型符合相关行业与应用场景的要求,并应对潜在的产品误用问题。
请通过[此处](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞或NVIDIA AI相关问题。
#### 个人与敏感信息
- **Nemotron内容安全数据集V2安全推理数据集**:本数据集的该部分已经过严格审查,已移除所有个人信息。
- **主题管控推理数据集(CantTalkAboutThis数据集)**:本数据集的该部分不包含任何真实个人或敏感数据。
#### 偏差、风险与局限性
**偏差**:
- 推理轨迹反映了所用三款模型(DeepSeek-R1、Qwen3-32B、gpt-oss)的固有偏差与推理模式。
- 基础数据集(Nemotron内容安全数据集V2与CantTalkAboutThis主题管控数据集)可能因构建过程存在固有偏差。
**风险**:
- 推理轨迹可能无法泛化至所有类型的安全违规或不合规场景。
- 相较于全长度推理轨迹,精简长度推理轨迹可能丢失重要上下文与细节。
- 基于本数据集训练的模型可能继承与部分安全框架不一致的推理模式。
**局限性**:
- 本数据集仅支持英语内容(Nemotron内容安全数据集V2部分可能包含少量其他语言样本)。
- 数据集覆盖范围仅限于基础数据集中包含的安全类别与领域。
- 推理质量取决于源推理模型的能力。
**法律合规**:
数据集使用者需确保其使用方式符合法律与伦理标准。
### 使用建议
使用者应知晓本数据集的风险、偏差与局限性。在使用本数据集训练防护模型时,建议:
- 在本数据集之外,使用多样化测试集验证模型性能
- 考虑结合其他安全数据集以扩展覆盖范围
- 监控推理模式中潜在的偏差
- 针对高风险安全决策设置人工审核环节
- 随着推理能力的提升定期更新模型
### 引用
**BibTeX格式**:
@inproceedings{sreedhar-etal-2025-safety,
title = "Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models",
author = "Sreedhar, Makesh Narsimhan and
Rebedea, Traian and
Parisien, Christopher",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.1193/",
pages = "21862--21880",
ISBN = "979-8-89176-335-7",
abstract = "Reasoning-based language models have demonstrated strong performance across various domains, with the most notable gains seen in mathematical and coding tasks. Recent research has shown that reasoning also offers significant benefits for LLM safety and guardrail applications. In this work, we conduct a comprehensive analysis of training reasoning-based guardrail models for content moderation, with an emphasis on generalization to custom safety policies at inference time. Our study focuses on two key dimensions: data efficiency and inference efficiency. On the data front, we find that reasoning-based models exhibit strong sample efficiency, achieving competitive performance with significantly fewer training examples than their non-reasoning counterparts. This unlocks the potential to repurpose the remaining data for mining high-value, difficult samples that further enhance model performance. On the inference side, we evaluate practical trade-offs by introducing reasoning budgets, examining the impact of reasoning length on latency and accuracy, and exploring dual-mode training to allow runtime control over reasoning behavior. Our findings will provide practical insights for researchers and developers to effectively and efficiently train and deploy reasoning-based guardrails models in real-world systems."
}
### 数据集卡片作者
* Makesh Narsimhan Sreedhar
* Traian Rebedea
* Christopher Parisien
### 数据集卡片联系人
* Makesh Narsimhan Sreedhar (makeshn@nvidia.com)
* Traian Rebedea (trebedea@nvidia.com)
提供机构:
maas
创建时间:
2025-12-02



