nvidia/Nemotron-Cascade-SFT-Stage-1

Name: nvidia/Nemotron-Cascade-SFT-Stage-1
Creator: nvidia
Published: 2025-12-18 18:35:06
License: 暂无描述

Hugging Face2025-12-18 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/nvidia/Nemotron-Cascade-SFT-Stage-1

下载链接

链接失效反馈

官方服务：

资源简介：

Nemotron-Cascade-SFT-Stage-1数据集用于Nemotron-Cascade项目的监督微调（SFT）第一阶段。该数据集涵盖数学、代码、科学和通用领域，利用了广泛多样的数据源。数学领域包含来自OpenMathReasoning和NuminaMath-CoT的问题；代码领域包含来自OpenCodeReasoning、MagicoderEvolInstruct等多个来源的提示；科学领域基于Nemotron-Post-Training-Dataset-v1和S1K的提示；通用领域则包含来自mmlu auxiliary train、ShareGPT等多个数据集的问题。所有回答均由DeepSeek-R1生成，并包含明确的推理（思考）过程。大多数提示生成多个回答。数据集总样本量为5,436,618，具体分布为：数学2,668,741，代码1,301,591，科学295,182，通用1,171,104。

The Nemotron-Cascade-SFT-Stage-1 dataset is used for the first stage of supervised fine-tuning (SFT) in the Nemotron-Cascade project. It covers the math, code, science, and general domains, leveraging a broad and diverse collection of data sources. The math domain incorporates questions from OpenMathReasoning and NuminaMath-CoT; the code domain draws prompts from OpenCodeReasoning, MagicoderEvolInstruct, and others; the science domain is built using prompts from Nemotron-Post-Training-Dataset-v1 and S1K; and the general domain includes questions from mmlu auxiliary train, ShareGPT, and others. All responses are generated with DeepSeek-R1 and include explicit reasoning (thinking) processes. Multiple responses are generated for most prompts. The total number of samples is 5,436,618, distributed as follows: math 2,668,741, code 1,301,591, science 295,182, and general 1,171,104.

提供机构：

nvidia

5,000+

优质数据集

54 个

任务类型

进入经典数据集