RLEP_dataset
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/Kwai-Klear/RLEP_dataset
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the datasets used in the paper [RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning](https://arxiv.org/abs/2507.07451).
RLEP (Reinforcement Learning with Experience rePlay) is a two-phase framework that first collects verified successful trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini‑batches that blend newly generated rollouts with these replayed successes. By replaying high‑quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance on math reasoning tasks.
Code: https://github.com/Kwai-Klear/RLEP
* The test Parquet file `dapo_format_aime2024_aime2025_amc2023.parquet` contains the AIME‑2024, AIME‑2025, and AMC‑2023 datasets. The AIME‑2024 portion is the official DAPO test set ([`aime-2024.parquet`](https://github.com/BytedTsinghua-SIA/DAPO/blob/main/eval/aime-2024.parquet)). We have appended the AIME‑2025 and AMC‑2023 splits to the same file, following the exact DAPO schema.
* The training Parquet file `dapo-math-17k-with-experience-pool.parquet` follows the same schema as (['dapo-math-17k.parquet'](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k/blob/main/data/dapo-math-17k.parquet)). The collected experience pool is stored in `reward_model.candidates` field. Samples whose questions had less than two successful trajectories were removed, leaving 14k different questions in total.
## Sample Usage
You can download the dataset using `git lfs` and concatenate the parts for the training data:
```bash
git lfs install
git clone https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset
cd RLEP_dataset
# concatenate the pieces in order
cat dapo-math-17k-with-experience-pool.parquet.part-* \
> dapo-math-17k-with-experience-pool.parquet
```
## Citation
If you find our paper or code helpful, we would appreciate it if you could cite our work:
```
@misc{zhang2025rlepreinforcementlearningexperience,
title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning},
author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
year={2025},
eprint={2507.07451},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.07451},
}
```
## Acknowledgement
We conducted our experiments with the [VERL](https://github.com/volcengine/verl) framework and the [Qwen2.5‑7B‑Math](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model, using the dataset and training scripts provided by [DAPO](https://dapo-sia.github.io/).
Many thanks to the open‑sourced works and the broader community for making these resources available!
本仓库包含论文《RLEP:面向大语言模型(Large Language Model,LLM)推理的带经验回放(Experience Replay)的强化学习》(RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning)[https://arxiv.org/abs/2507.07451] 中使用的数据集。
RLEP是一种两阶段框架,首先采集经验证的成功推理轨迹,随后在后续训练中回放这些轨迹。在每一次更新步骤中,模型策略会在混合了新生成推演轨迹与上述回放成功轨迹的小批次数据上进行优化。通过回放高质量示例,RLEP可引导模型规避无意义探索,将学习聚焦于潜力推理路径,并在数学推理任务上实现更快收敛与更优异的最终性能。
代码地址:https://github.com/Kwai-Klear/RLEP
* 测试集Parquet文件`dapo_format_aime2024_aime2025_amc2023.parquet`涵盖AIME-2024、AIME-2025及AMC-2023数据集。其中AIME-2024部分为官方DAPO测试集([`aime-2024.parquet`](https://github.com/BytedTsinghua-SIA/DAPO/blob/main/eval/aime-2024.parquet))。我们严格遵循DAPO数据集格式,将AIME-2025与AMC-2023拆分数据集追加至该文件中。
* 训练集Parquet文件`dapo-math-17k-with-experience-pool.parquet`与`dapo-math-17k.parquet`(https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k/blob/main/data/dapo-math-17k.parquet)采用完全一致的格式。采集得到的经验池存储于`reward_model.candidates`字段内。我们移除了成功轨迹数量不足两条的问题样本,最终共保留14000个独立问题。
### 示例用法
您可通过`git lfs`下载该数据集,并按顺序拼接训练数据分片:
bash
git lfs install
git clone https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset
cd RLEP_dataset
# 按顺序拼接数据分片
cat dapo-math-17k-with-experience-pool.parquet.part-*
> dapo-math-17k-with-experience-pool.parquet
### 引用
若您的工作得益于本论文或代码,恳请引用我们的研究成果:
@misc{zhang2025rlepreinforcementlearningexperience,
title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning},
author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
year={2025},
eprint={2507.07451},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.07451},
}
### 致谢
本研究依托[VERL](https://github.com/volcengine/verl)框架与[Qwen2.5-7B-Math](https://huggingface.co/Qwen/Qwen2.5-Math-7B)模型完成,并使用了[DAPO](https://dapo-sia.github.io/)提供的数据集与训练脚本。衷心感谢各类开源项目与广大社区开发者的开源共享!
提供机构:
maas
创建时间:
2025-09-06



