Light-R1-DPOData

Name: Light-R1-DPOData
Creator: maas
Published: 2025-12-05 16:54:51
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/360zhinao/Light-R1-DPOData

下载链接

链接失效反馈

官方服务：

资源简介：

# Light-R1: Surpassing R1-Distill from Scratch\* with \$1000 through Curriculum SFT & DPO *\*from models without long COT* [technical report](https://arxiv.org/abs/2503.10460) [GitHub page](https://github.com/Qihoo360/Light-R1) Here is the DPO data we used to train [Light-R1-32B](https://huggingface.co/qihoo360/Light-R1-32B). Simply refer to `dpo-pairs.json` |Model|Trained From|Release Date|AIME24|AIME25| | ---- | ---- | ---- | ---- | ---- | |DeepSeek-R1-Distill-Llama-70B|Llama-3.3-70B-Instruct|25.1.20|70.0|54.1| |DeepSeek-R1-Distill-Qwen-32B|Qwen2.5-32B|25.1.20|72.6|54.9| |LIMO (32B)|Qwen2.5-32B-Instruct|25.2.4|56.3|47.1| |s1.1-32B|Qwen2.5-32B-Instruct|25.2.8|64.7|47.8| |OpenThinker-32B|Qwen2.5-32B-Instruct|25.2.12|66.0|50.9| | [**Light-R1-32B (ours)** 🤗](https://huggingface.co/qihoo360/Light-R1-32B) |Qwen2.5-32B-Instruct|25.3.4|**76.6**|**64.6**| While much work has been open-sourced trying to reproduce DeepSeek-R1 on models of 72B or less, **none** achieves similar performance on the hard math competition AIME24 as DeepSeek-R1-Distill-Qwen-32B's score 72.6. We introduce Light-R1-32B, which achieves 76.6 on AIME24 training from Qwen2.5-32B-Instruct. Starting from models without long COT (*from scratch* in terms of R1) and training on decontaminated math data, we distilled DeepSeek-R1 with curriculum SFT & DPO to **surpass DeepSeek-R1-Distill-Qwen-32B** on AIME24 & 25, and improved further with model merging. More importantly, besides the state-of-the-art from-scratch model Light-R1-32B, we also released on Day 1 all training datasets of our curriculum SFT & DPO and training code based on [360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory). Estimated training time on 12 x H800 machines takes no more than 6 hours --- around \$1000. We believe Light-R1 represents a practical way of training strong long COT models from scratch (from models without long COT). While we are working to further improve our models with RL, curriculum SFT & DPO facilitates more control along the pipeline and is more cost-friendly. With the rapid development of training and inference techniques, we hope to see more accessible long-COT models in the near future and Light-R1 provides a validated transparent way to train them in at least specialized domains. [WeChat Group here.](https://github.com/Qihoo360/Light-R1/blob/main/wechat-group.JPG) ## Release Details - Light-R1-32B model on [🤗 huggingface](https://huggingface.co/qihoo360/Light-R1-32B) - Curriculum [🤗SFT](https://huggingface.co/datasets/qihoo360/Light-R1-SFT) & [🤗DPO](https://huggingface.co/datasets/qihoo360/Light-R1-DPO) datasets - Training scripts based on [360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory) in [train-scripts](./train-scripts/) - Evaluation code based on [DeepScaleR](https://github.com/agentica-project/deepscaler) in [deepscaler-release](./deepscaler-release/) - along with evaluation logs of Light-R1-32B (e.g. [AIME24](https://huggingface.co/qihoo360/Light-R1-32B/blob/main/evaluation-results.aime24.json)) - all our reported scores are averaged over 64 runs; public models' scores are taken from their evaluation results and if not present, averaged over 64 runs; we found that averaging over 16 runs sometimes leads to deviation over 2-3 points across different runs - Technical report work in progress ## Inference Notes Light-R1-32B does not always think as its thinking capabilities are trained only with math data. We forced Light-R1 to think by hard-coding `<think>` in the chat template right before the model is supposed to generate output, as suggested by [DeepSeek](https://x.com/deepseek_ai/status/1890324295181824107). [vLLM](https://github.com/vllm-project/vllm) or [SGLang](https://github.com/sgl-project/sglang) are suggested for inference. Light-R1-32B inherits Qwen models' chat template with `<think>` and `</think>` added as special tokens and `<think>` hard-coded to force thinking. ## Post-Training through Curriculum SFT & DPO | | AIME24 pass@1 (64 average) | AIME25 | GPQA Diamond | | --- | --- | --- | --- | | Qwen2.5-32B-Instruct | 16.6 | 13.6 | 48.8 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 54.9 | 62.1 | | Light-R1-SFT-stage1 | 69.0 | 57.4 | 64.3 | | Light-R1-SFT-stage2 | 73.0 | 64.3 | 60.6 | | Light-R1-DPO | 75.8 | 63.4 | 61.8 | | Light-R1-32B | 76.6 | 64.6 | 61.8 | We adopted a curriculum learning approach with SFT and DPO. ### Math Data Sources Training questions are collected from public math datasets including [OpenR1-Math-220k](open-r1/OpenR1-Math-220k), [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k), [LIMO](https://huggingface.co/datasets/GAIR/LIMO), [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2), [s1K-1.1](https://huggingface.co/datasets/simplescaling/s1K-1.1), [Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH), [hendrycks_math](https://hf-mirror.com/datasets/baber/hendrycks_math) and AIME (up to 2023). We decontaminated the questions against common Reasoning benchmarks such as AIME24/25, MATH-500 and GPQA Diamond. ### Curriculum SFT & DPO We collected responses from DeepSeek-R1 on these questions and filtered them based on verification and difficulty levels rated by sampling [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview), forming a 76k dataset for **SFT stage1**. After SFT stage1, a more difficult set, mostly filtered from the 76k dataset, was constructed with 3k data for **SFT stage2**. > This stage2 data could boost DeepSeek-R1-Distill-Qwen-32B from 72.6/54.9 to 0.779/0.675 on AIME 24/25. Then we sampled Light-R1-SFT-stage2's responses after SFT stage2, filtered correct and incorrect ones for each question and construct DPO pairs based on verification results and DeepSeek-R1's responses. **DPO**(or [NCA](https://github.com/thu-ml/Noise-Contrastive-Alignment)) is performed on top of SFT stage2 with sequence parallelism in [360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory). The above training steps are fairly fast and are estimated to finish in less than 6 hours on 12 x H800 machines, hence the estimate of \$1000. ### Model Merging Finally, we merged models of SFT-stage2, DPO and another DPO version with AIME24 score 74.7. The two DPO versions differ in that one of the data has special tokens skipped in rejected responses. Interestingly, the resulting version also exhibits improvement. We observed stepwise improvement in our approach and intermediate evaluation results of each stage are listed in the table above. On the GPQA evaluation of scientific questions we didn't train on at all, math-specialized training has led to some degree of forgetting. However, Light-R1-32B still demonstrates strong generalization ability. ## Data Decontamination We carefully evaluated data contamination of several open-sourced datasets. While certain contamination may be [inevitable during pre-training](https://x.com/DimitrisPapail/status/1888325914603516214), it is unacceptable for post-training to compare on benchmarks. MATH-500 is somewhat compromised with tens of questions that are identical or only numbers changed. AIME 24 and 25 stay intact but we have to pay special attention when we incorporate AIME data up to 2023. Light-R1-32B did thorough decontamination with exact or N-gram matching. ## License & Acknowledgements All released materials of this project follow the open-source license Apache 2.0. Our training experiments are powered by [360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory). Our evaluation scripts are based on [DeepScaleR](https://github.com/agentica-project/deepscaler) and therefore [verl](https://github.com/volcengine/veRL). Light-R1-32B is trained from [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct). Training data are collected from various public sources. The paper for this dataset can be found [here](https://huggingface.co/papers/2503.10460). ## Citation ```bibtex @misc{lightr1proj, title={Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond}, author={Liang Wen, Fenrui Xiao, Xin He, Yunke Cai, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang}, year={2025}, eprint={}, archivePrefix={}, url={https://github.com/Qihoo360/Light-R1}, } ```

# Light-R1：通过课程监督微调与偏好对齐，从零起步（训练成本仅约1000美元）超越R1-Distill* *注：训练起点为不具备长链式思维（Long Chain-of-Thought, COT）能力的模型* [技术报告](https://arxiv.org/abs/2503.10460) [GitHub仓库](https://github.com/Qihoo360/Light-R1) 以下为我们用于训练[Light-R1-32B](https://huggingface.co/qihoo360/Light-R1-32B)所用的偏好对齐（Direct Preference Optimization, DPO）数据集，可直接通过`dpo-pairs.json`获取。 | 模型名称 | 训练起点 | 发布日期 | AIME24得分 | AIME25得分 | | ---- | ---- | ---- | ---- | ---- | | DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | 2025.1.20 | 70.0 | 54.1 | | DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 2025.1.20 | 72.6 | 54.9 | | LIMO (32B) | Qwen2.5-32B-Instruct | 2025.2.4 | 56.3 | 47.1 | | s1.1-32B | Qwen2.5-32B-Instruct | 2025.2.8 | 64.7 | 47.8 | | OpenThinker-32B | Qwen2.5-32B-Instruct | 2025.2.12 | 66.0 | 50.9 | | **[Light-R1-32B（本研究成果） 🤗](https://huggingface.co/qihoo360/Light-R1-32B)** | Qwen2.5-32B-Instruct | 2025.3.4 | **76.6** | **64.6** | 尽管已有诸多开源工作尝试在720亿参数及以下规模的模型上复现DeepSeek-R1，但**尚无**方法能在高难度数学竞赛数据集AIME24上达到DeepSeek-R1-Distill-Qwen-32B的72.6分这一相近水准。我们提出Light-R1-32B，该模型以Qwen2.5-32B-Instruct为训练起点，在AIME24上取得了76.6分的成绩。我们从不具备长链式思维能力的模型起步（即R1任务中的“从零开始”），并基于去污染后的数学数据集，通过课程监督微调（Supervised Fine-Tuning, SFT）与偏好对齐（DPO）对DeepSeek-R1进行蒸馏，最终在AIME24与AIME25上**超越了DeepSeek-R1-Distill-Qwen-32B**，并通过模型融合进一步提升了性能。更为重要的是，除了这款当前最优的从零起步模型Light-R1-32B之外，我们在项目上线首日即开源了课程监督微调与偏好对齐所用的全部训练数据集，以及基于[360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory)开发的训练代码。在12台H800服务器上的训练时长预计不超过6小时，总成本约为1000美元。我们认为Light-R1为从零起步（从不具备长链式思维能力的模型）训练高性能长链式思维模型提供了一条切实可行的路径。尽管我们正通过强化学习（Reinforcement Learning, RL）进一步优化模型，但课程监督微调与偏好对齐的训练流程能够更好地管控整个训练 pipeline，且成本更为可控。随着训练与推理技术的快速发展，我们期待在不久的将来能看到更多可及性更强的长链式思维模型，而Light-R1为在至少特定领域中训练这类模型提供了一套经过验证的透明化方案。 [微信群二维码](https://github.com/Qihoo360/Light-R1/blob/main/wechat-group.JPG) ## 项目发布详情 - Light-R1-32B 模型：托管于[🤗 Hugging Face](https://huggingface.co/qihoo360/Light-R1-32B) - 课程监督微调与偏好对齐数据集：托管于[🤗 SFT数据集](https://huggingface.co/datasets/qihoo360/Light-R1-SFT)与[🤗 DPO数据集](https://huggingface.co/datasets/qihoo360/Light-R1-DPO) - 基于[360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory)开发的训练脚本：位于[train-scripts](./train-scripts/)目录下 - 基于[DeepScaleR](https://github.com/agentica-project/deepscaler)开发的评估代码：位于[deepscaler-release](./deepscaler-release/)目录下 - 附带Light-R1-32B的评估日志（例如[AIME24评估结果](https://huggingface.co/qihoo360/Light-R1-32B/blob/main/evaluation-results.aime24.json)） - 本研究报告的所有得分均为64次运行的平均值；公开模型的得分取自其官方评估结果，若无则按64次运行取平均。我们发现，仅对16次运行取平均有时会导致不同运行间的得分偏差达到2-3分 - 技术报告仍在撰写中 ## 推理说明 Light-R1-32B并非总能自动触发链式思维，因为其思维能力仅通过数学数据训练得到。我们参考[DeepSeek](https://x.com/deepseek_ai/status/1890324295181824107)的建议，通过在对话模板中模型生成输出前硬编码`<think>`标记，强制Light-R1触发思维过程。推荐使用[vLLM](https://github.com/vllm-project/vllm)或[SGLang](https://github.com/sgl-project/sglang)进行推理。Light-R1-32B继承了Qwen系列模型的对话模板，并将`<think>`与`</think>`作为特殊标记添加，同时硬编码`<think>`以强制触发思维过程。 ## 基于课程监督微调与偏好对齐的后训练流程 | | AIME24 pass@1（64次运行平均） | AIME25得分 | GPQA Diamond得分 | | --- | --- | --- | --- | | Qwen2.5-32B-Instruct | 16.6 | 13.6 | 48.8 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 54.9 | 62.1 | | Light-R1-SFT-stage1 | 69.0 | 57.4 | 64.3 | | Light-R1-SFT-stage2 | 73.0 | 64.3 | 60.6 | | Light-R1-DPO | 75.8 | 63.4 | 61.8 | | Light-R1-32B | 76.6 | 64.6 | 61.8 | 我们采用了结合监督微调与偏好对齐的课程学习训练方案。 ### 数学数据源训练问题收集自多个公开数学数据集，包括[OpenR1-Math-220k](open-r1/OpenR1-Math-220k)、[OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)、[LIMO](https://huggingface.co/datasets/GAIR/LIMO)、[OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)、[s1K-1.1](https://huggingface.co/datasets/simplescaling/s1K-1.1)、[Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH)、[hendrycks_math](https://hf-mirror.com/datasets/baber/hendrycks_math)以及截至2023年的AIME竞赛题。我们针对AIME24/25、MATH-500与GPQA Diamond等主流推理基准数据集对训练问题进行了去污染处理。 ### 课程监督微调与偏好对齐我们收集了DeepSeek-R1对上述问题的回答，并基于验证结果与由[DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)采样评估得到的难度等级进行筛选，构建了包含7.6万条数据的**SFT第一阶段数据集**。在完成SFT第一阶段训练后，我们从7.6万条数据中进一步筛选出更具挑战性的3000条数据，构建了**SFT第二阶段数据集**。 > 该第二阶段数据集可将DeepSeek-R1-Distill-Qwen-32B在AIME24/25上的性能从72.6/54.9提升至0.779/0.675。随后，我们对完成SFT第二阶段训练的Light-R1-SFT-stage2的回答进行采样，针对每个问题筛选出正确与错误的回答，并基于验证结果与DeepSeek-R1的回答构建偏好对齐（DPO）样本对。 **偏好对齐（DPO，或称噪声对比对齐（Noise-Contrastive-Alignment, NCA））**在SFT第二阶段的模型基础上，通过[360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory)实现的序列并行方式完成训练。上述训练步骤耗时极短，在12台H800服务器上的总训练时长预计不超过6小时，因此总成本约为1000美元。 ### 模型融合最后，我们将SFT第二阶段模型、偏好对齐模型以及另一款AIME24得分为74.7的偏好对齐模型进行融合。两款偏好对齐模型的区别在于其中一款在拒绝回答中跳过了特殊标记。有趣的是，融合后的模型同样实现了性能提升。我们的训练方案实现了阶段性性能提升，各阶段的中间评估结果已在上表中列出。在我们未进行训练的科学问题数据集GPQA上，专门针对数学领域的训练导致了一定程度的性能遗忘，但Light-R1-32B仍展现出了较强的泛化能力。 ## 数据去污染我们对多个开源数据集的数据污染情况进行了仔细评估。尽管在预训练阶段，部分数据污染可能[难以避免](https://x.com/DimitrisPapail/status/1888325914603516214)，但在后训练阶段使用存在污染的数据进行基准测试是不可接受的。MATH-500数据集存在一定程度的污染，其中数十道题目与其他数据集完全一致或仅修改了数字。AIME 24和25数据集未受污染，但在纳入截至2023年的AIME数据时，我们仍需格外谨慎。Light-R1-32B通过精确匹配与N-gram匹配的方式进行了彻底的数据去污染处理。 ## 许可与致谢本项目所有开源材料均遵循Apache 2.0开源协议。本研究的训练实验基于[360-LLaMA-Factory](https://github.com/Qihoo360/360-LLaMA-Factory)开发。我们的评估脚本基于[DeepScaleR](https://github.com/agentica-project/deepscaler)开发，因此也依赖于[verl](https://github.com/volcengine/veRL)。 Light-R1-32B以[Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)为基础进行训练。训练数据收集自多个公开数据源。本项目的技术报告可在[此处](https://huggingface.co/papers/2503.10460)查阅。 ## 引用格式 bibtex @misc{lightr1proj, title={Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond}, author={Liang Wen, Fenrui Xiao, Xin He, Yunke Cai, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang}, year={2025}, eprint={}, archivePrefix={}, url={https://github.com/Qihoo360/Light-R1}, }

提供机构：

maas

创建时间：

2025-10-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集