Nemotron-Safety-30K
收藏魔搭社区2025-12-03 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Nemotron-Safety-30K
下载链接
链接失效反馈官方服务:
资源简介:
# **Nemotron-Safety-30K**
Nemotron-Safety-30K is a modular post-training dataset specifically designed for safety-based model training. This dataset has been retrieved and curated from the larger Llama-Nemotron-Post-Training-Dataset to focus on safety-related training scenarios.
## Dataset Details
- **Size**: 31,426 rows
- **Format**: Parquet
- **License**: CC-BY-4.0
- **File Size**: 14 MB
- **Modalities**: Text
- **Libraries**: Datasets, pandas, Croissant
## Dataset Structure
The dataset contains the following columns:
- **input**: List of conversation inputs with role-based structure
- **output**: String responses with associated lengths
- **category**: Classification category (primarily "safety")
- **generator**: Model generator information (Mixtral-8x22B-Instruct-v0.1 & more)
- **license**: License information (cc-by-4.0)
- **reasoning**: String indicating reasoning approach ("on" or "off")
## Data Format
Each row follows a conversational format with role-based inputs:
```
{
"role": "user",
"content": "[User query or prompt]"
}
```
The outputs provide safety-oriented responses designed for training models to handle potentially sensitive or harmful content appropriately.
## Source
This dataset is derived from the comprehensive Llama-Nemotron-Post-Training-Dataset available at:
https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset
## Usage
This dataset is intended for researchers and developers working on:
- Safety alignment in large language models
- Post-training refinement for responsible AI
- Safety evaluation and benchmarking
- Modular training approaches for specific safety scenarios
## License
This dataset is under following licensing:
- CC-BY-4.0 for the dataset content
## Citation
```bibtex
@misc{llama-nemotron-2025,
title={Llama-Nemotron: Efficient Reasoning Models},
author={Nvidia},
year={2025},
howpublished={Hugging Face Datasets},
url={https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset}
}
```
# **Nemotron-Safety-30K**
Nemotron-Safety-30K 是一款专为安全导向型模型训练打造的模块化后训练数据集。该数据集从规模更大的 Llama-Nemotron-Post-Training-Dataset 中筛选并整理而来,专门聚焦安全相关的训练场景。
## 数据集详情
- **数据量**:31426 条记录
- **格式**:Parquet
- **许可证**:CC-BY-4.0
- **文件大小**:14 MB
- **模态**:文本
- **依赖库**:Datasets、pandas、Croissant
## 数据集结构
该数据集包含以下列:
- **input**:包含角色化结构的对话输入列表
- **output**:附带长度信息的字符串回复
- **category**:分类类别(主要为“安全”类)
- **generator**:模型生成器信息(如 Mixtral-8x22B-Instruct-v0.1 等)
- **license**:许可证信息(cc-by-4.0)
- **reasoning**:指示推理模式的字符串(取值为“on”或“off”)
## 数据格式
每条数据均采用角色化对话格式,输入示例如下:
{
"role": "user",
"content": "[用户查询或提示词]"
}
该数据集的回复均为安全导向型内容,用于训练模型以妥善处理潜在敏感或有害信息。
## 数据来源
本数据集源自规模完整的 Llama-Nemotron-Post-Training-Dataset,可通过以下链接获取:
https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset
## 使用场景
本数据集面向开展以下研究与开发工作的科研人员与开发者:
- 大语言模型(Large Language Model,LLM)的安全对齐
- 负责任人工智能的后训练优化
- 安全评估与基准测试
- 针对特定安全场景的模块化训练方法
## 许可证
本数据集采用以下许可证:
- 数据集内容遵循 CC-BY-4.0 协议
## 引用格式
bibtex
@misc{llama-nemotron-2025,
title={Llama-Nemotron: Efficient Reasoning Models},
author={Nvidia},
year={2025},
howpublished={Hugging Face Datasets},
url={https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset}
}
提供机构:
maas
创建时间:
2025-06-09



