LIMR
收藏魔搭社区2026-01-06 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/GAIR/LIMR
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
# LIMR: Less is More for RL Scaling
</div>
<p align="center">
📄 <a href="https://github.com/GAIR-NLP/LIMR/blob/master/limr.pdf" target="_blank">Paper</a> |
🌐 <a href="https://huggingface.co/datasets/GAIR/LIMR" target="_blank">Dataset</a> |
📘 <a href="https://huggingface.co/GAIR/LIMR" target="_blank">Model</a>
</p>
## Releases
[2025/02/17] We're releasing the following components:
- 🛠️ **LIM Tools**: Implementation of our **Learning Impact Measurement** methodology
- 🚀 **Training & Evaluation**: Complete implementation of our training pipeline and evaluation scripts
- 🔥 **[LIMR Dataset](https://huggingface.co/datasets/GAIR/LIMR)**: Our curated dataset of 1,389 mathematical questions
- 🤖 **[LIMR Model](https://huggingface.co/GAIR/LIMR)**: Model training on the LIMR dataset.
## Overview
This repository presents **LIMR**, an approach that challenges the assumption about data scaling in reinforcement learning for LLMs. We demonstrate that the quality and relevance of training samples matter far more than their quantity. Our **Learning Impact Measurement (LIM)** methodology enables automated evaluation of training sample effectiveness, eliminating the need for manual curation while achieving **comparable or superior** results with **6x less** data. Notably, all our investigations are conducted directly from base models without distillation, providing clear insights into the core dynamics of RL training.
Our key findings revolutionize the understanding of RL training dynamics:
- A strategically selected subset of training samples (1,389) can achieve comparable or even superior performance compared to training with the full dataset (8,523), fundamentally challenging the assumption that larger datasets necessarily lead to better performance.
- We introduce Learning Impact Measurement (LIM), an automated quantitative method for probing the potential value of RL training samples, enabling systematic analysis of how different samples contribute to model improvement.
- While distilled long-form reasoning data has shown efficiency in larger models, at the scale of ~1K samples with small models (7B), our data-efficient RL approach significantly outperforms SFT with distilled data.
- The path to better reasoning capabilities may not lie in simply scaling up RL training data, but rather in being more selective about which samples to use.
Performance across challenging mathematical benchmarks:
| Method | #Questions | AIME2024 | MATH500 | AMC2023 | AVG. |
|--------|------------|-----------|----------|-----------|-------|
| Qwen-Math-7B | - | 16.7 | 52.4 | 52.5 | 40.5 |
| Qwen-Math-7B-FULL | 8,523 | 32.5 | 76.6 | 61.9 | 57.0 |
| Qwen-Math-7B-RAND | 1,389 | 25.8 | 66.0 | 56.3 | 49.4 |
| Qwen-Math-7B-LINEAR | 1,138 | 28.3 | 74.6 | 61.9 | 54.9 |
| LIMR | 1,389 | **32.5** | **78.0** | **63.8** | **58.1** |
Comparsion with other popular RL recipes. We apply RL directly from the base model, without using distilled long chain-of-thought data from larger or stronger models, and only use 1k questions.
| Methods | Init Model | Long CoT Dis. | #Questions |
|-----------|------------|---------------|------------|
| STILL-3 | Instruct | Yes | 29,925 |
| DeepScaleR| Instruct | Yes | 40,314 |
| Sky-T1 | Instruct | Yes | 45,000 |
| THUDM-T1 | Instruct | No | 30,000 |
| PRIME | Instruct | No | 150,000 |
| SimpleRL | Base | No | 8,523 |
| LIMR | Base | No | 1,389 |
## Acknowledgements
Our work builds upon the insightful technical reports from [DeepSeek R1](https://github.com/deepseek-ai/DeepSeek-R1) and [Kimi-k1.5](https://github.com/MoonshotAI/Kimi-k1.5) teams. We extend our appreciation to the [Qwen-Math](https://github.com/QwenLM/Qwen2.5-Math) team for their open-source model, and to the creators of [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [vLLM](https://github.com/vllm-project/vllm) for providing the essential reinforcement learning framework and inference infrastructure, respectively, that enabled this research.
## Citation
If you find this work useful, please cite our paper:
```bibtex
@misc{limr2025,
author = {Li, Xuefeng and Zou, Haoyang and Liu, Pengfei},
title = {LIMR: Less is More for RL Scaling},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/GAIR-NLP/LIMR}},
}
```
<div align="center">
# LIMR:少即是多——面向强化学习扩展
</div>
<p align="center">
📄 <a href="https://github.com/GAIR-NLP/LIMR/blob/master/limr.pdf" target="_blank">论文</a> |
🌐 <a href="https://huggingface.co/datasets/GAIR/LIMR" target="_blank">数据集</a> |
📘 <a href="https://huggingface.co/GAIR/LIMR" target="_blank">模型</a>
</p>
## 版本发布
[2025/02/17] 我们正式发布以下组件:
- 🛠️ **LIM工具集**:我们提出的**学习影响度量(Learning Impact Measurement,LIM)**方法的实现代码
- 🚀 **训练与评估**:完整的训练流水线与评估脚本实现
- 🔥 **[LIMR数据集](https://huggingface.co/datasets/GAIR/LIMR)**:我们精心整理的1389道数学题数据集
- 🤖 **[LIMR模型](https://huggingface.co/GAIR/LIMR)**:基于LIMR数据集训练得到的模型。
## 项目概述
本仓库推出**LIMR**方案,该研究对大语言模型(Large Language Model,LLM)强化学习(Reinforcement Learning,RL)中的数据扩展假设提出了挑战。我们证明,训练样本的质量与相关性远重于其数量。我们提出的**学习影响度量(Learning Impact Measurement,LIM)**方法可自动评估训练样本的有效性,无需人工筛选即可实现数据量减少6倍的情况下仍取得相当甚至更优的效果。值得注意的是,本研究所有实验均直接基于基础模型开展,未使用知识蒸馏技术,从而为强化学习训练的核心动态机制提供了清晰的解读。
我们的核心发现革新了对强化学习训练动态机制的认知:
- 经过策略筛选的1389条训练样本子集,即可实现与全量数据集(8523条样本)相当甚至更优的性能,从根本上挑战了“数据集规模越大性能越好”的固有假设。
- 我们引入学习影响度量(LIM),这是一种可自动量化探究强化学习训练样本潜在价值的方法,能够系统分析不同样本对模型性能提升的贡献方式。
- 尽管经过蒸馏的长格式推理数据在大模型中已展现出效率优势,但在小模型(70亿参数)且仅使用约1000条样本的场景下,我们的数据高效强化学习方法显著优于基于蒸馏数据的监督微调(Supervised Fine-Tuning,SFT)。
- 实现更优推理能力的路径或许并非单纯扩大强化学习训练数据的规模,而是更精准地筛选可用样本。
### 高难度数学基准测试性能
| 方法 | 问题数量 | AIME2024 | MATH500 | AMC2023 | 平均得分 |
|--------|------------|-----------|----------|-----------|-------|
| Qwen-Math-7B | - | 16.7 | 52.4 | 52.5 | 40.5 |
| Qwen-Math-7B-FULL | 8,523 | 32.5 | 76.6 | 61.9 | 57.0 |
| Qwen-Math-7B-RAND | 1,389 | 25.8 | 66.0 | 56.3 | 49.4 |
| Qwen-Math-7B-LINEAR | 1,138 | 28.3 | 74.6 | 61.9 | 54.9 |
| LIMR | 1,389 | **32.5** | **78.0** | **63.8** | **58.1** |
与其他主流强化学习训练方案的对比。本研究直接基于基础模型开展强化学习训练,未使用来自更大更强模型的蒸馏长思维链数据,且仅使用1000条问题样本。
| 方法 | 初始模型 | 长思维链蒸馏 | 问题数量 |
|-----------|------------|---------------|------------|
| STILL-3 | 指令微调模型 | 是 | 29,925 |
| DeepScaleR| 指令微调模型 | 是 | 40,314 |
| Sky-T1 | 指令微调模型 | 是 | 45,000 |
| THUDM-T1 | 指令微调模型 | 否 | 30,000 |
| PRIME | 指令微调模型 | 否 | 150,000 |
| SimpleRL | 基础模型 | 否 | 8,523 |
| LIMR | 基础模型 | 否 | 1,389 |
## 致谢
本研究基于DeepSeek R1与Kimi-k1.5团队的富有启发性的技术报告。我们感谢Qwen-Math团队开源的模型,以及OpenRLHF与vLLM的开发者分别提供的核心强化学习框架与推理基础设施,为本研究的开展提供了关键支撑。
## 引用
若您认为本研究对您的工作有所帮助,请引用我们的论文:
bibtex
@misc{limr2025,
author = {Li, Xuefeng and Zou, Haoyang and Liu, Pengfei},
title = {LIMR: Less is More for RL Scaling},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {url{https://github.com/GAIR-NLP/LIMR}},
}
提供机构:
maas
创建时间:
2025-02-18



