Light-R1-SFTData

Name: Light-R1-SFTData
Creator: maas
Published: 2025-12-05 16:54:51
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/360zhinao/Light-R1-SFTData

下载链接

链接失效反馈

官方服务：

资源简介：

# Light-R1: Surpassing R1-Distill from Scratch\* with \$1000 through Curriculum SFT & DPO *\*from models without long COT* [technical report](https://arxiv.org/abs/2503.10460) [GitHub page](https://github.com/Qihoo360/Light-R1) Here are the two-stage SFT data we used to train [Light-R1-32B](https://huggingface.co/qihoo360/Light-R1-32B). Simply refer to `stage1-76k.json` and `stage2-3k.json` |Model|Trained From|Release Date|AIME24|AIME25| | ---- | ---- | ---- | ---- | ---- | |DeepSeek-R1-Distill-Llama-70B|Llama-3.3-70B-Instruct|25.1.20|70.0|54.1| |DeepSeek-R1-Distill-Qwen-32B|Qwen2.5-32B|25.1.20|72.6|54.9| |LIMO (32B)|Qwen2.5-32B-Instruct|25.2.4|56.3|47.1| |s1.1-32B|Qwen2.5-32B-Instruct|25.2.8|64.7|47.8| |OpenThinker-32B|Qwen2.5-32B-Instruct|25.2.12|66.0|50.9| | [**Light-R1-32B (ours)** 🤗](https://huggingface.co/qihoo360/Light-R1-32B) |Qwen2.5-32B-Instruct|25.3.4|**76.6**|**64.6**| While much work has been open-sourced trying to reproduce DeepSeek-R1 on models of 72B or less, **none** achieves similar performance on the hard math competition AIME24 as DeepSeek-R1-Distill-Qwen-32B's score 72.6. We introduce Light-R1-32B, which achieves 76.6 on AIME24 training from Qwen2.5-32B-Instruct. Starting from models without long COT (*from scratch* in terms of R1) and training on decontaminated math data, we distilled DeepSeek-R1 with curriculum SFT & DPO to **surpass DeepSeek-R1-Distill-Qwen-32B** on AIME24 & 25, and improved further with model merging. More importantly, besides the state-of-the-art from-scratch model Light-R1-32B, we also released on Day 1 all training datasets of our curriculum SFT & DPO and training code based on [360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory). Estimated training time on 12 x H800 machines takes no more than 6 hours --- around \$1000. We believe Light-R1 represents a practical way of training strong long COT models from scratch (from models without long COT). While we are working to further improve our models with RL, curriculum SFT & DPO facilitates more control along the pipeline and is more cost-friendly. With the rapid development of training and inference techniques, we hope to see more accessible long-COT models in the near future and Light-R1 provides a validated transparent way to train them in at least specialized domains. [WeChat Group here.](https://github.com/Qihoo360/Light-R1/blob/main/wechat-group.JPG) ## Release Details - Light-R1-32B model on [🤗 huggingface](https://huggingface.co/qihoo360/Light-R1-32B) - Curriculum [🤗SFT](https://huggingface.co/datasets/qihoo360/Light-R1-SFT) & [🤗DPO](https://huggingface.co/datasets/qihoo360/Light-R1-DPO) datasets - Training scripts based on [360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory) in [train-scripts](./train-scripts/) - Evaluation code based on [DeepScaleR](https://github.com/agentica-project/deepscaler) in [deepscaler-release](./deepscaler-release/) - along with evaluation logs of Light-R1-32B (e.g. [AIME24](https://huggingface.co/qihoo360/Light-R1-32B/blob/main/evaluation-results.aime24.json)) - all our reported scores are averaged over 64 runs; public models' scores are taken from their evaluation results and if not present, averaged over 64 runs; we found that averaging over 16 runs sometimes leads to deviation over 2-3 points across different runs - Technical report work in progress ## Inference Notes Light-R1-32B does not always think as its thinking capabilities are trained only with math data. We forced Light-R1 to think by hard-coding `<think>` in the chat template right before the model is supposed to generate output, as suggested by [DeepSeek](https://x.com/deepseek_ai/status/1890324295181824107). [vLLM](https://github.com/vllm-project/vllm) or [SGLang](https://github.com/sgl-project/sglang) are suggested for inference. Light-R1-32B inherits Qwen models' chat template with `<think>` and `</think>` added as special tokens and `<think>` hard-coded to force thinking. ## Post-Training through Curriculum SFT & DPO | | AIME24 pass@1 (64 average) | AIME25 | GPQA Diamond | | --- | --- | --- | --- | | Qwen2.5-32B-Instruct | 16.6 | 13.6 | 48.8 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 54.9 | 62.1 | | Light-R1-SFT-stage1 | 69.0 | 57.4 | 64.3 | | Light-R1-SFT-stage2 | 73.0 | 64.3 | 60.6 | | Light-R1-DPO | 75.8 | 63.4 | 61.8 | | Light-R1-32B | 76.6 | 64.6 | 61.8 | We adopted a curriculum learning approach with SFT and DPO. ### Math Data Sources Training questions are collected from public math datasets including [OpenR1-Math-220k](open-r1/OpenR1-Math-220k), [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k), [LIMO](https://huggingface.co/datasets/GAIR/LIMO), [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2), [s1K-1.1](https://huggingface.co/datasets/simplescaling/s1K-1.1), [Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH), [hendrycks_math](https://hf-mirror.com/datasets/baber/hendrycks_math) and AIME (up to 2023). We decontaminated the questions against common Reasoning benchmarks such as AIME24/25, MATH-500 and GPQA Diamond. ### Curriculum SFT & DPO We collected responses from DeepSeek-R1 on these questions and filtered them based on verification and difficulty levels rated by sampling [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview), forming a 76k dataset for **SFT stage1**. After SFT stage1, a more difficult set, mostly filtered from the 76k dataset, was constructed with 3k data for **SFT stage2**. > This stage2 data could boost DeepSeek-R1-Distill-Qwen-32B from 72.6/54.9 to 0.779/0.675 on AIME 24/25. Then we sampled Light-R1-SFT-stage2's responses after SFT stage2, filtered correct and incorrect ones for each question and construct DPO pairs based on verification results and DeepSeek-R1's responses. **DPO**(or [NCA](https://github.com/thu-ml/Noise-Contrastive-Alignment)) is performed on top of SFT stage2 with sequence parallelism in [360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory). The above training steps are fairly fast and are estimated to finish in less than 6 hours on 12 x H800 machines, hence the estimate of \$1000. ### Model Merging Finally, we merged models of SFT-stage2, DPO and another DPO version with AIME24 score 74.7. The two DPO versions differ in that one of the data has special tokens skipped in rejected responses. Interestingly, the resulting version also exhibits improvement. We observed stepwise improvement in our approach and intermediate evaluation results of each stage are listed in the table above. On the GPQA evaluation of scientific questions we didn't train on at all, math-specialized training has led to some degree of forgetting. However, Light-R1-32B still demonstrates strong generalization ability. ## Data Decontamination We carefully evaluated data contamination of several open-sourced datasets. While certain contamination may be [inevitable during pre-training](https://x.com/DimitrisPapail/status/1888325914603516214), it is unacceptable for post-training to compare on benchmarks. MATH-500 is somewhat compromised with tens of questions that are identical or only numbers changed. AIME 24 and 25 stay intact but we have to pay special attention when we incorporate AIME data up to 2023. Light-R1-32B did thorough decontamination with exact or N-gram matching. ## License & Acknowledgements All released materials of this project follow the open-source license Apache 2.0. Our training experiments are powered by [360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory). Our evaluation scripts are based on [DeepScaleR](https://github.com/agentica-project/deepscaler) and therefore [verl](https://github.com/volcengine/veRL). Light-R1-32B is trained from [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). Training data are collected from various public sources. ## Citation ```bibtex @misc{lightr1proj, title={Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond}, author={Liang Wen, Fenrui Xiao, Xin He, Yunke Cai, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang}, year={2025}, eprint={2503.10460}, archivePrefix={arXiv}, url={https://github.com/Qihoo360/Light-R1}, } ```

# Light-R1：通过课程式监督微调与偏好对齐从零开始训练，仅需约1000美元成本即可超越R1-Distill模型 * 注：训练基座为不具备长链式思维（Long Chain-of-Thought, COT）能力的模型 [技术报告](https://arxiv.org/abs/2503.10460) [GitHub项目页](https://github.com/Qihoo360/Light-R1) 以下为我们用于训练[Light-R1-32B](https://huggingface.co/qihoo360/Light-R1-32B)的两阶段监督微调（Supervised Fine-Tuning, SFT）数据集，可直接调用`stage1-76k.json`与`stage2-3k.json`获取。 |模型|训练基座|发布日期|AIME24得分|AIME25得分| | ---- | ---- | ---- | ---- | ---- | |DeepSeek-R1-Distill-Llama-70B|Llama-3.3-70B-Instruct|25.1.20|70.0|54.1| |DeepSeek-R1-Distill-Qwen-32B|Qwen2.5-32B|25.1.20|72.6|54.9| |LIMO (32B)|Qwen2.5-32B-Instruct|25.2.4|56.3|47.1| |s1.1-32B|Qwen2.5-32B-Instruct|25.2.8|64.7|47.8| |OpenThinker-32B|Qwen2.5-32B-Instruct|25.2.12|66.0|50.9| | **Light-R1-32B（本文提出）** 🤗|Qwen2.5-32B-Instruct|25.3.4|**76.6**|**64.6**| 尽管已有诸多开源工作尝试在720亿参数及以下的模型上复现DeepSeek-R1，但**无一**能在硬核数学竞赛数据集AIME24上达到DeepSeek-R1-Distill-Qwen-32B的72.6分水准。我们提出的Light-R1-32B模型基于Qwen2.5-32B-Instruct基座训练，在AIME24数据集上取得了76.6分的成绩。本工作从无长COT能力的模型出发（即R1任务视角下的从零开始训练），在经过数据去污染的数学数据集上，通过课程式SFT与DPO对DeepSeek-R1进行蒸馏，最终**在AIME24与AIME25数据集上超越了DeepSeek-R1-Distill-Qwen-32B**，并通过模型融合进一步提升了性能。更为重要的是，除了这款当前最优的从零训练模型Light-R1-32B之外，我们在项目上线首日即开源了课程式SFT与DPO阶段的全部训练数据集，以及基于[360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory)开发的训练代码。在12张H800显卡的服务器上，总训练时长不超过6小时，总成本约为1000美元。我们认为Light-R1为从零开始训练高性能长COT模型（基于无长COT能力的基座模型）提供了一条切实可行的路径。目前我们正通过强化学习（Reinforcement Learning, RL）进一步优化模型，而课程式SFT与DPO训练流程不仅便于全链路的可控性调整，同时具备更高的成本效益。随着训练与推理技术的快速发展，我们期待在不久的将来能看到更多易于部署的长COT模型，而Light-R1则为在至少垂直领域内训练这类模型提供了一套经过验证的透明化方案。 [微信群聊入口](https://github.com/Qihoo360/Light-R1/blob/main/wechat-group.JPG) ## 发布详情 - 模型权重：[Light-R1-32B](https://huggingface.co/qihoo360/Light-R1-32B) 已上传至🤗 Hugging Face平台 - 数据集：课程式[监督微调数据集](https://huggingface.co/datasets/qihoo360/Light-R1-SFT)与[偏好对齐数据集](https://huggingface.co/datasets/qihoo360/Light-R1-DPO) - 训练脚本：基于[360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory)开发的训练脚本存放于[train-scripts](./train-scripts/)目录 - 评估代码：基于[DeepScaleR](https://github.com/agentica-project/deepscaler)开发的评估代码存放于[deepscaler-release](./deepscaler-release/)目录 - 同时包含Light-R1-32B的评估日志（例如[AIME24数据集评估结果](https://huggingface.co/qihoo360/Light-R1-32B/blob/main/evaluation-results.aime24.json)） - 本文报告的所有模型得分均为64次推理的平均值；公开模型的得分取自其官方评估结果，若无官方结果则同样按64次推理取平均。我们发现仅对16次推理结果取平均时，不同批次的得分偏差可达2-3分。 - 技术报告仍在撰写中 ## 推理说明 Light-R1-32B并非总能自动触发链式思维，因其思维能力仅通过数学数据训练得到。我们按照[DeepSeek](https://x.com/deepseek_ai/status/1890324295181824107)的建议，通过在聊天模板中模型生成输出前硬编码`<think>`标记来强制模型启动思维过程。推荐使用[vLLM](https://github.com/vllm-project/vllm)或[SGLang](https://github.com/sgl-project/sglang)进行模型推理。Light-R1-32B继承了Qwen系列模型的聊天模板，并额外添加了`<think>`与`</think>`作为特殊标记，同时硬编码`<think>`标记以强制模型开启思维流程。 ## 基于课程式SFT与DPO的后训练流程 | | AIME24 pass@1（64次平均）| AIME25 | GPQA Diamond | | --- | --- | --- | --- | | Qwen2.5-32B-Instruct | 16.6 | 13.6 | 48.8 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 54.9 | 62.1 | | Light-R1-SFT-stage1 | 69.0 | 57.4 | 64.3 | | Light-R1-SFT-stage2 | 73.0 | 64.3 | 60.6 | | Light-R1-DPO | 75.8 | 63.4 | 61.8 | | Light-R1-32B | 76.6 | 64.6 | 61.8 | 我们采用了结合SFT与DPO的课程式学习训练流程。 ### 数学数据集来源训练数据集源自多个公开数学数据集，包括[OpenR1-Math-220k](open-r1/OpenR1-Math-220k)、[OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)、[LIMO](https://huggingface.co/datasets/GAIR/LIMO)、[OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)、[s1K-1.1](https://huggingface.co/datasets/simplescaling/s1K-1.1)、[Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH)、[hendrycks_math](https://hf-mirror.com/datasets/baber/hendrycks_math)以及截至2023年的AIME竞赛真题。我们针对AIME24/25、MATH-500与GPQA Diamond等主流推理基准数据集对训练数据进行了去污染处理。 ### 课程式SFT与DPO训练我们首先收集DeepSeek-R1在上述数据集上的生成结果，并通过[DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)采样评估结果的正确性与难度等级进行筛选，构建了包含76k条数据的**SFT第一阶段数据集**。完成SFT第一阶段训练后，我们从76k条数据中进一步筛选出难度更高的3k条数据，构建了**SFT第二阶段数据集**。 > 该第二阶段数据集可将DeepSeek-R1-Distill-Qwen-32B在AIME24/25上的得分从72.6/54.9提升至0.779/0.675。在完成SFT第二阶段训练后，我们采样得到Light-R1-SFT-stage2的生成结果，针对每个问题筛选出正确与错误的回答，并基于评估结果与DeepSeek-R1的生成结果构建DPO偏好对。 **DPO（或称[NCA](https://github.com/thu-ml/Noise-Contrastive-Alignment)）**在SFT第二阶段的模型基础上进行训练，训练过程通过[360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory)实现序列并行加速。上述训练流程效率极高，在12张H800显卡的服务器上总耗时不超过6小时，因此总成本约为1000美元。 ### 模型融合最后，我们将SFT第二阶段训练得到的模型、DPO训练得到的模型，以及另一款在AIME24上取得74.7分的DPO模型进行了融合。两款DPO模型的差异在于其中一款在拒绝回答样本中跳过了特殊标记的处理，令人惊喜的是，融合后的模型性能进一步得到了提升。我们的训练流程呈现出逐步性能提升的趋势，各阶段的中间评估结果已在上表中列出。在未参与训练的科学问答数据集GPQA上，针对数学领域的专项训练导致模型出现了一定程度的性能遗忘，但Light-R1-32B仍展现出较强的泛化能力。 ## 数据去污染我们对多个开源数据集的数据污染情况进行了细致评估。尽管预训练阶段的数据污染[在一定程度上难以避免](https://x.com/DimitrisPapail/status/1888325914603516214)，但在后训练阶段使用基准数据集进行评估时，数据污染是不可接受的。MATH-500数据集存在一定程度的污染，其中数十道题目与其他数据集完全一致或仅修改了数值。AIME24与AIME25数据集未受污染，但在纳入截至2023年的AIME真题时需要格外谨慎。Light-R1-32B通过精确匹配与N-gram匹配的方式完成了全面的数据去污染处理。 ## 许可与致谢本项目的所有开源材料均遵循Apache 2.0开源协议。我们的训练实验基于[360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory)开发完成。评估脚本基于[DeepScaleR](https://github.com/agentica-project/deepscaler)开发，因此也依赖[veRL](https://github.com/volcengine/veRL)库。Light-R1-32B基于[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)基座模型训练得到，训练数据集源自多个公开数据源。 ## 引用 bibtex @misc{lightr1proj, title={"Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond"}, author={Liang Wen, Fenrui Xiao, Xin He, Yunke Cai, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang}, year={2025}, eprint={2503.10460}, archivePrefix={arXiv}, url={https://github.com/Qihoo360/Light-R1}, }

提供机构：

maas

创建时间：

2025-10-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集