five

Nemotron-Safety-30K

收藏
魔搭社区2025-12-03 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Nemotron-Safety-30K
下载链接
链接失效反馈
官方服务:
资源简介:
# **Nemotron-Safety-30K** Nemotron-Safety-30K is a modular post-training dataset specifically designed for safety-based model training. This dataset has been retrieved and curated from the larger Llama-Nemotron-Post-Training-Dataset to focus on safety-related training scenarios. ## Dataset Details - **Size**: 31,426 rows - **Format**: Parquet - **License**: CC-BY-4.0 - **File Size**: 14 MB - **Modalities**: Text - **Libraries**: Datasets, pandas, Croissant ## Dataset Structure The dataset contains the following columns: - **input**: List of conversation inputs with role-based structure - **output**: String responses with associated lengths - **category**: Classification category (primarily "safety") - **generator**: Model generator information (Mixtral-8x22B-Instruct-v0.1 & more) - **license**: License information (cc-by-4.0) - **reasoning**: String indicating reasoning approach ("on" or "off") ## Data Format Each row follows a conversational format with role-based inputs: ``` { "role": "user", "content": "[User query or prompt]" } ``` The outputs provide safety-oriented responses designed for training models to handle potentially sensitive or harmful content appropriately. ## Source This dataset is derived from the comprehensive Llama-Nemotron-Post-Training-Dataset available at: https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset ## Usage This dataset is intended for researchers and developers working on: - Safety alignment in large language models - Post-training refinement for responsible AI - Safety evaluation and benchmarking - Modular training approaches for specific safety scenarios ## License This dataset is under following licensing: - CC-BY-4.0 for the dataset content ## Citation ```bibtex @misc{llama-nemotron-2025, title={Llama-Nemotron: Efficient Reasoning Models}, author={Nvidia}, year={2025}, howpublished={Hugging Face Datasets}, url={https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset} } ```

# **Nemotron-Safety-30K** Nemotron-Safety-30K 是一款专为安全导向型模型训练打造的模块化后训练数据集。该数据集从规模更大的 Llama-Nemotron-Post-Training-Dataset 中筛选并整理而来,专门聚焦安全相关的训练场景。 ## 数据集详情 - **数据量**:31426 条记录 - **格式**:Parquet - **许可证**:CC-BY-4.0 - **文件大小**:14 MB - **模态**:文本 - **依赖库**:Datasets、pandas、Croissant ## 数据集结构 该数据集包含以下列: - **input**:包含角色化结构的对话输入列表 - **output**:附带长度信息的字符串回复 - **category**:分类类别(主要为“安全”类) - **generator**:模型生成器信息(如 Mixtral-8x22B-Instruct-v0.1 等) - **license**:许可证信息(cc-by-4.0) - **reasoning**:指示推理模式的字符串(取值为“on”或“off”) ## 数据格式 每条数据均采用角色化对话格式,输入示例如下: { "role": "user", "content": "[用户查询或提示词]" } 该数据集的回复均为安全导向型内容,用于训练模型以妥善处理潜在敏感或有害信息。 ## 数据来源 本数据集源自规模完整的 Llama-Nemotron-Post-Training-Dataset,可通过以下链接获取: https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset ## 使用场景 本数据集面向开展以下研究与开发工作的科研人员与开发者: - 大语言模型(Large Language Model,LLM)的安全对齐 - 负责任人工智能的后训练优化 - 安全评估与基准测试 - 针对特定安全场景的模块化训练方法 ## 许可证 本数据集采用以下许可证: - 数据集内容遵循 CC-BY-4.0 协议 ## 引用格式 bibtex @misc{llama-nemotron-2025, title={Llama-Nemotron: Efficient Reasoning Models}, author={Nvidia}, year={2025}, howpublished={Hugging Face Datasets}, url={https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset} }
提供机构:
maas
创建时间:
2025-06-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作