five

nvidia/Nemotron-Cascade-SFT-Stage-1

收藏
Hugging Face2025-12-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nvidia/Nemotron-Cascade-SFT-Stage-1
下载链接
链接失效反馈
官方服务:
资源简介:
Nemotron-Cascade-SFT-Stage-1数据集用于Nemotron-Cascade项目的监督微调(SFT)第一阶段。该数据集涵盖数学、代码、科学和通用领域,利用了广泛多样的数据源。数学领域包含来自OpenMathReasoning和NuminaMath-CoT的问题;代码领域包含来自OpenCodeReasoning、MagicoderEvolInstruct等多个来源的提示;科学领域基于Nemotron-Post-Training-Dataset-v1和S1K的提示;通用领域则包含来自mmlu auxiliary train、ShareGPT等多个数据集的问题。所有回答均由DeepSeek-R1生成,并包含明确的推理(思考)过程。大多数提示生成多个回答。数据集总样本量为5,436,618,具体分布为:数学2,668,741,代码1,301,591,科学295,182,通用1,171,104。

The Nemotron-Cascade-SFT-Stage-1 dataset is used for the first stage of supervised fine-tuning (SFT) in the Nemotron-Cascade project. It covers the math, code, science, and general domains, leveraging a broad and diverse collection of data sources. The math domain incorporates questions from OpenMathReasoning and NuminaMath-CoT; the code domain draws prompts from OpenCodeReasoning, MagicoderEvolInstruct, and others; the science domain is built using prompts from Nemotron-Post-Training-Dataset-v1 and S1K; and the general domain includes questions from mmlu auxiliary train, ShareGPT, and others. All responses are generated with DeepSeek-R1 and include explicit reasoning (thinking) processes. Multiple responses are generated for most prompts. The total number of samples is 5,436,618, distributed as follows: math 2,668,741, code 1,301,591, science 295,182, and general 1,171,104.
提供机构:
nvidia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作