SimpleRL-verl-modeleval
收藏魔搭社区2025-08-05 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/bert123/SimpleRL-verl-modeleval
下载链接
链接失效反馈官方服务:
资源简介:
<h1 style="text-align: center;">verl: Volcano Engine Reinforcement Learning for LLM</h1>
verl is a flexible, efficient and production-ready RL training library for large language models (LLMs).
verl is the open-source version of **[HybridFlow: A Flexible and Efficient RLHF Framework](https://arxiv.org/abs/2409.19256v2)** paper.
verl is flexible and easy to use with:
- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
- Readily integration with popular HuggingFace models
verl is fast with:
- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, verl achieves high generation and training throughput.
- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
<p align="center">
| <a href="https://verl.readthedocs.io/en/latest/index.html"><b>Documentation</b></a> | <a href="https://arxiv.org/abs/2409.19256v2"><b>Paper</b></a> | <a href="https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA"><b>Slack</b></a> | <a href="https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG"><b>Wechat</b></a> | <a href="https://x.com/verl_project"><b>Twitter</b></a>
<!-- <a href=""><b>Slides</b></a> | -->
</p>
## News
- [2025/3] We will present verl(HybridFlow) at [EuroSys 2025](https://2025.eurosys.org/). See you in in Rotterdam!
- [2025/2] verl v0.2.0.post1 is released! See [release note](https://github.com/volcengine/verl/releases/) for details.
- [2025/2] We presented verl in the [Bytedance/NVIDIA/Anyscale Ray Meetup](https://lu.ma/ji7atxux). See you in San Jose!
- [2025/1] [Doubao-1.5-pro](https://team.doubao.com/zh/special/doubao_1_5_pro) is released with SOTA-level performance on LLM & VLM. The RL scaling preview model is trained using verl, reaching OpenAI O1-level performance on math benchmarks (70.0 pass@1 on AIME).
- [2024/12] The team presented <a href="https://neurips.cc/Expo/Conferences/2024/workshop/100677">Post-training LLMs: From Algorithms to Infrastructure</a> at NeurIPS 2024. [Slides](https://github.com/eric-haibin-lin/verl-data/tree/neurips) and [video](https://neurips.cc/Expo/Conferences/2024/workshop/100677) available.
- [2024/12] verl is presented at Ray Forward 2024. Slides available [here](https://github.com/eric-haibin-lin/verl-community/blob/main/slides/Ray_Forward_2024_%E5%B7%AB%E9%94%A1%E6%96%8C.pdf).
- [2024/10] verl is presented at Ray Summit. [Youtube video](https://www.youtube.com/watch?v=MrhMcXkXvJU&list=PLzTswPQNepXntmT8jr9WaNfqQ60QwW7-U&index=37) available.
- [2024/08] HybridFlow (verl) is accepted to EuroSys 2025.
## Key Features
- **FSDP** and **Megatron-LM** for training.
- **vLLM** and **TGI** for rollout generation, **SGLang** support coming soon.
- huggingface models support
- Supervised fine-tuning
- Reinforcement learning from human feedback with [PPO](https://github.com/volcengine/verl/tree/main/examples/ppo_trainer), [GRPO](https://github.com/volcengine/verl/tree/main/examples/grpo_trainer), [ReMax](https://github.com/volcengine/verl/tree/main/examples/remax_trainer), [Reinforce++](https://verl.readthedocs.io/en/latest/examples/config.html#algorithm), [RLOO](https://github.com/volcengine/verl/tree/main/examples/rloo_trainer/run_qwen2-7b.sh), etc
- Support model-based reward and function-based reward (verifiable reward)
- flash-attention, [sequence packing](examples/ppo_trainer/run_qwen2-7b_seq_balance.sh), [long context](examples/ppo_trainer/run_deepseek7b_llm_sp2.sh) support via DeepSpeed Ulysses, [LoRA](examples/sft/gsm8k/run_qwen_05_peft.sh), [Liger-kernel](examples/sft/gsm8k/run_qwen_05_sp2_liger.sh)
- scales up to 70B models and hundreds of GPUs
- experiment tracking with wandb, swanlab and mlflow
## Upcoming Features
- Reward model training
- DPO training
- DeepSeek integration with Megatron v0.11
- SGLang integration
- vision language model RL
## Getting Started
**Quickstart:**
- [Installation](https://verl.readthedocs.io/en/latest/start/install.html)
- [Quickstart](https://verl.readthedocs.io/en/latest/start/quickstart.html)
- [Programming Guide](https://verl.readthedocs.io/en/latest/hybrid_flow.html)
**Running a PPO example step-by-step:**
- Data and Reward Preparation
- [Prepare Data for Post-Training](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)
- [Implement Reward Function for Dataset](https://verl.readthedocs.io/en/latest/preparation/reward_function.html)
- Understanding the PPO Example
- [PPO Example Architecture](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html)
- [Config Explanation](https://verl.readthedocs.io/en/latest/examples/config.html)
- [Run GSM8K Example](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html)
**Reproducible algorithm baselines:**
- [PPO, GRPO, ReMax](https://verl.readthedocs.io/en/latest/experiment/ppo.html)
**For code explanation and advance usage (extension):**
- PPO Trainer and Workers
- [PPO Ray Trainer](https://verl.readthedocs.io/en/latest/workers/ray_trainer.html)
- [PyTorch FSDP Backend](https://verl.readthedocs.io/en/latest/workers/fsdp_workers.html)
- [Megatron-LM Backend](https://verl.readthedocs.io/en/latest/index.html)
- Advance Usage and Extension
- [Ray API design tutorial](https://verl.readthedocs.io/en/latest/advance/placement.html)
- [Extend to Other RL(HF) algorithms](https://verl.readthedocs.io/en/latest/advance/dpo_extension.html)
- [Add Models with the FSDP Backend](https://verl.readthedocs.io/en/latest/advance/fsdp_extension.html)
- [Add Models with the Megatron-LM Backend](https://verl.readthedocs.io/en/latest/advance/megatron_extension.html)
- [Deployment using Separate GPU Resources](https://github.com/volcengine/verl/tree/main/examples/split_placement)
**Blogs from the community**
- [使用verl进行GRPO分布式强化学习训练最佳实践](https://www.volcengine.com/docs/6459/1463942)
- [HybridFlow veRL 原文浅析](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/readme.md)
- [最高提升20倍吞吐量!豆包大模型团队发布全新 RLHF 框架,现已开源!](https://team.doubao.com/en/blog/%E6%9C%80%E9%AB%98%E6%8F%90%E5%8D%8720%E5%80%8D%E5%90%9E%E5%90%90%E9%87%8F-%E8%B1%86%E5%8C%85%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%9B%A2%E9%98%9F%E5%8F%91%E5%B8%83%E5%85%A8%E6%96%B0-rlhf-%E6%A1%86%E6%9E%B6-%E7%8E%B0%E5%B7%B2%E5%BC%80%E6%BA%90)
Checkout this [Jupyter Notebook](https://github.com/volcengine/verl/tree/main/examples/ppo_trainer/verl_getting_started.ipynb) to get started with PPO training with a single 24GB L4 GPU (**FREE** GPU quota provided by [Lighting Studio](https://lightning.ai/hlin-verl/studios/verl-getting-started))!
## Performance Tuning Guide
The performance is essential for on-policy RL algorithm. We write a detailed performance tuning guide to allow people tune the performance. See [here](https://verl.readthedocs.io/en/latest/perf/perf_tuning.html) for more details.
## vLLM v0.7 integration preview
We have released a testing version of veRL that supports vLLM>=0.7.0. Please refer to [this document](https://github.com/volcengine/verl/blob/main/docs/README_vllm0.7.md) for installation guide and more information.
## Citation and acknowledgement
If you find the project helpful, please cite:
- [HybridFlow: A Flexible and Efficient RLHF Framework](https://arxiv.org/abs/2409.19256v2)
- [A Framework for Training Large Language Models for Code Generation via Proximal Policy Optimization](https://i.cs.hku.hk/~cwu/papers/gmsheng-NL2Code24.pdf)
```tex
@article{sheng2024hybridflow,
title = {HybridFlow: A Flexible and Efficient RLHF Framework},
author = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
year = {2024},
journal = {arXiv preprint arXiv: 2409.19256}
}
```
verl is inspired by the design of Nemo-Aligner, Deepspeed-chat and OpenRLHF. The project is adopted and supported by Anyscale, Bytedance, LMSys.org, Shanghai AI Lab, Tsinghua University, UC Berkeley, UCLA, UIUC, University of Hong Kong, and many more.
## Awesome work using verl
- [TinyZero](https://github.com/Jiayi-Pan/TinyZero): a reproduction of **DeepSeek R1 Zero** recipe for reasoning tasks
- [PRIME](https://github.com/PRIME-RL/PRIME): Process reinforcement through implicit rewards
- [RAGEN](https://github.com/ZihanWang314/ragen): a general-purpose reasoning **agent** training framework
- [Logic-RL](https://github.com/Unakar/Logic-RL): a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.
- [deepscaler](https://github.com/agentica-project/deepscaler): iterative context scaling with GRPO
- [critic-rl](https://github.com/HKUNLP/critic-rl): LLM critics for code generation
- [Easy-R1](https://github.com/hiyouga/EasyR1): **Multi-modal** RL training framework
- [self-rewarding-reasoning-LLM](https://arxiv.org/pdf/2502.19613): self-rewarding and correction with **generative reward models**
- [Search-R1](https://github.com/PeterGriffinJin/Search-R1): RL with reasoning and **searching (tool-call)** interleaved LLMs
- [Code-R1](https://github.com/ganler/code-r1): Reproducing R1 for **Code** with Reliable Rewards
- [DQO](https://arxiv.org/abs/2410.09302): Enhancing multi-Step reasoning abilities of language models through direct Q-function optimization
- [FIRE](https://arxiv.org/abs/2410.21236): Flaming-hot initiation with regular execution sampling for large language models
## Contribution Guide
Contributions from the community are welcome! Please checkout our [roadmap](https://github.com/volcengine/verl/issues/22) and [release plan](https://github.com/volcengine/verl/issues/354).
### Code formatting
We use yapf (Google style) to enforce strict code formatting when reviewing PRs. To reformat you code locally, make sure you installed **latest** `yapf`
```bash
pip3 install yapf --upgrade
```
Then, make sure you are at top level of verl repo and run
```bash
bash scripts/format.sh
```
We are HIRING! Send us an [email](mailto:haibin.lin@bytedance.com) if you are interested in internship/FTE opportunities in MLSys/LLM reasoning/multimodal alignment.
<h1 style="text-align: center;">verl:面向大语言模型(Large Language Model,LLM)的火山引擎强化学习框架</h1>
verl 是一款灵活、高效且可投入生产环境的大语言模型强化学习训练库。
verl 是论文**《HybridFlow:一款灵活高效的人类反馈强化学习(Reinforcement Learning from Human Feedback,RLHF)框架》(HybridFlow: A Flexible and Efficient RLHF Framework)**的开源实现版本。
verl 具备以下灵活易用的特性:
- **多样化强化学习算法的便捷扩展**:其混合编程模型融合了单控制器与多控制器范式的优势,可灵活表征并高效执行复杂的后训练数据流,用户仅需数行代码即可构建强化学习数据流。
- **模块化API实现与现有大语言模型基础设施的无缝集成**:通过解耦计算与数据依赖,可与PyTorch FSDP、Megatron-LM、vLLM等现有大语言模型框架无缝对接,且用户可轻松扩展至其他大语言模型训练与推理框架。
- **灵活的设备映射**:支持将模型部署至不同的GPU集群,实现高效的资源利用率,并可在不同规模的集群中实现良好的扩展性。
- **可快速集成主流HuggingFace模型**
verl 具备以下高性能特性:
- **业界领先(State-of-the-art,SOTA)的吞吐量**:通过无缝集成现有主流大语言模型训练与推理框架,verl 可实现极高的生成与训练吞吐量。
- **基于3D-HybridEngine的高效Actor模型重分片**:可消除内存冗余,大幅降低训练与生成阶段切换时的通信开销。
<p align="center">| <a href="https://verl.readthedocs.io/en/latest/index.html"><b>官方文档</b></a> | <a href="https://arxiv.org/abs/2409.19256v2"><b>学术论文</b></a> | <a href="https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA"><b>Slack 社区</b></a> | <a href="https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG"><b>微信交流群</b></a> | <a href="https://x.com/verl_project"><b>X 主页</b></a>
<!-- <a href=""><b>Slides</b></a> | -->
</p>
## 项目动态
- [2025/3] 我们将在[EuroSys 2025](https://2025.eurosys.org/)上展示verl(HybridFlow),鹿特丹见!
- [2025/2] verl v0.2.0.post1 正式发布!详情请查阅[更新日志](https://github.com/volcengine/verl/releases/)。
- [2025/2] 我们在[字节跳动/NVIDIA/Anyscale Ray 线下交流会](https://lu.ma/ji7atxux)上展示了verl,圣何塞见!
- [2025/1] [Doubao-1.5-pro](https://team.doubao.com/zh/special/doubao_1_5_pro) 正式发布,在大语言模型与多模态大语言模型(Vision-Language Model,VLM)任务上均达到业界领先性能。该模型的强化学习缩放预览版基于verl训练而成,在数学基准测试中达到了OpenAI O1级别的性能(AIME数据集上pass@1为70.0)。
- [2024/12] 团队在NeurIPS 2024上展示了主题为<a href="https://neurips.cc/Expo/Conferences/2024/workshop/100677">《大语言模型后训练:从算法到基础设施》</a>的报告,相关[演示幻灯片](https://github.com/eric-haibin-lin/verl-data/tree/neurips)与[演讲视频](https://neurips.cc/Expo/Conferences/2024/workshop/100677)已公开。
- [2024/12] verl 在Ray Forward 2024上展出,演示幻灯片可通过[此链接](https://github.com/eric-haibin-lin/verl-community/blob/main/slides/Ray_Forward_2024_%E5%B7%AB%E9%94%A1%E6%96%8C.pdf)获取。
- [2024/10] verl 在Ray Summit上展出,相关[Youtube视频](https://www.youtube.com/watch?v=MrhMcXkXvJU&list=PLzTswPQNepXntmT8jr9WaNfqQ60QwW7-U&index=37)已公开。
- [2024/08] HybridFlow(verl)被EuroSys 2025收录。
## 核心特性
- 支持基于**FSDP**与**Megatron-LM**的训练流程。
- 支持基于**vLLM**与**TGI**的采样生成,即将支持**SGLang**。
- 兼容主流HuggingFace模型。
- 支持监督微调(Supervised Fine-Tuning,SFT)。
- 支持基于人类反馈的强化学习,包含[PPO](https://github.com/volcengine/verl/tree/main/examples/ppo_trainer)、[GRPO](https://github.com/volcengine/verl/tree/main/examples/grpo_trainer)、[ReMax](https://github.com/volcengine/verl/tree/main/examples/remax_trainer)、[Reinforce++](https://verl.readthedocs.io/en/latest/examples/config.html#algorithm)、[RLOO](https://github.com/volcengine/verl/tree/main/examples/rloo_trainer/run_qwen2-7b.sh)等多种算法。
- 支持基于模型的奖励函数与基于函数的可验证奖励函数。
- 支持FlashAttention、通过DeepSpeed Ulysses实现的序列打包与长上下文处理,兼容LoRA与Liger-Kernel。
- 可扩展至700亿参数模型与数百GPU的集群规模。
- 支持通过wandb、SwanLab与MLflow进行实验追踪。
## 待上线特性
- 奖励模型训练。
- 直接偏好优化(Direct Preference Optimization,DPO)训练。
- 支持DeepSeek与Megatron v0.11的集成。
- 支持SGLang集成。
- 多模态大语言模型强化学习。
## 快速开始
**快速入门**
- [安装指南](https://verl.readthedocs.io/en/latest/start/install.html)
- [快速上手教程](https://verl.readthedocs.io/en/latest/start/quickstart.html)
- [编程指南](https://verl.readthedocs.io/en/latest/hybrid_flow.html)
**分步运行PPO示例**
- 数据与奖励准备
- [后训练数据准备](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)
- [为数据集实现奖励函数](https://verl.readthedocs.io/en/latest/preparation/reward_function.html)
- 理解PPO示例
- [PPO示例架构](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html)
- [配置文件说明](https://verl.readthedocs.io/en/latest/examples/config.html)
- [运行GSM8K示例](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html)
**可复现的算法基线**
- [PPO、GRPO、ReMax](https://verl.readthedocs.io/en/latest/experiment/ppo.html)
**代码解析与高级扩展用法**
- PPO训练器与工作节点
- [PPO Ray训练器](https://verl.readthedocs.io/en/latest/workers/ray_trainer.html)
- [PyTorch FSDP后端](https://verl.readthedocs.io/en/latest/workers/fsdp_workers.html)
- [Megatron-LM后端](https://verl.readthedocs.io/en/latest/index.html)
- 高级用法与扩展
- [Ray API设计教程](https://verl.readthedocs.io/en/latest/advance/placement.html)
- [扩展至其他强化学习(人类反馈)算法](https://verl.readthedocs.io/en/latest/advance/dpo_extension.html)
- [为FSDP后端添加新模型](https://verl.readthedocs.io/en/latest/advance/fsdp_extension.html)
- [为Megatron-LM后端添加新模型](https://verl.readthedocs.io/en/latest/advance/megatron_extension.html)
- [使用独立GPU资源进行部署](https://github.com/volcengine/verl/tree/main/examples/split_placement)
**社区贡献博客**
- [使用verl进行GRPO分布式强化学习训练最佳实践](https://www.volcengine.com/docs/6459/1463942)
- [HybridFlow veRL 原文浅析](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/readme.md)
- [最高提升20倍吞吐量!豆包大模型团队发布全新 RLHF 框架,现已开源!](https://team.doubao.com/en/blog/%E6%9C%80%E9%AB%98%E6%8F%90%E5%8D%8720%E5%80%8D%E5%90%9E%E5%90%90%E9%87%8F-%E8%B1%86%E5%8C%85%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%9B%A2%E9%98%9F%E5%8F%91%E5%B8%83%E5%85%A8%E6%96%B0-rlhf-%E6%A1%86%E6%9E%B6-%E7%8E%B0%E5%B7%B2%E5%BC%80%E6%BA%90)
你可以通过此[Jupyter Notebook](https://github.com/volcengine/verl/tree/main/examples/ppo_trainer/verl_getting_started.ipynb)快速上手PPO训练,仅需一块24GB的L4 GPU([Lightning Studio](https://lightning.ai/hlin-verl/studios/verl-getting-started)提供免费GPU配额)!
## 性能调优指南
对于同策略强化学习算法而言,性能至关重要。我们编写了详细的性能调优指南以帮助用户优化模型性能,详情请查阅[此链接](https://verl.readthedocs.io/en/latest/perf/perf_tuning.html)。
## vLLM v0.7 集成预览版
我们已发布支持vLLM>=0.7.0的veRL测试版本,安装指南与更多详情请查阅[此文档](https://github.com/volcengine/verl/blob/main/docs/README_vllm0.7.md)。
## 引用与致谢
若您认为本项目对您的研究有所帮助,请引用以下文献:
- [《HybridFlow:一款灵活高效的人类反馈强化学习框架》](https://arxiv.org/abs/2409.19256v2)
- [《基于近端策略优化的代码生成大语言模型训练框架》](https://i.cs.hku.hk/~cwu/papers/gmsheng-NL2Code24.pdf)
tex
@article{sheng2024hybridflow,
title = {HybridFlow: A Flexible and Efficient RLHF Framework},
author = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
year = {2024},
journal = {arXiv preprint arXiv: 2409.19256}
}
verl 的设计灵感来源于Nemo-Aligner、DeepSpeed-Chat与OpenRLHF。本项目得到了Anyscale、字节跳动、LMSys.org、上海人工智能实验室、清华大学、加州大学伯克利分校、加州大学洛杉矶分校、伊利诺伊大学厄巴纳-香槟分校、香港大学等机构的采纳与支持。
## 使用verl的优秀项目
- [TinyZero](https://github.com/Jiayi-Pan/TinyZero):复现**DeepSeek R1 Zero**推理任务训练流程的项目。
- [PRIME](https://github.com/PRIME-RL/PRIME):基于隐式奖励的过程强化学习框架。
- [RAGEN](https://github.com/ZihanWang314/ragen):通用型推理AI智能体(AI Agent)训练框架。
- [Logic-RL](https://github.com/Unakar/Logic-RL):在2K Tiny Logic Puzzle数据集上复现DeepSeek R1 Zero的项目。
- [deepscaler](https://github.com/agentica-project/deepscaler):基于GRPO的迭代上下文缩放项目。
- [critic-rl](https://github.com/HKUNLP/critic-rl):用于代码生成的大语言模型评判器项目。
- [Easy-R1](https://github.com/hiyouga/EasyR1):多模态强化学习训练框架。
- [self-rewarding-reasoning-LLM](https://arxiv.org/pdf/2502.19613):基于生成式奖励模型的自我奖励与修正推理大语言模型项目。
- [Search-R1](https://github.com/PeterGriffinJin/Search-R1):将推理与搜索(工具调用)交替执行的大语言模型强化学习框架。
- [Code-R1](https://github.com/ganler/code-r1):基于可靠奖励复现R1代码生成任务的项目。
- [DQO](https://arxiv.org/abs/2410.09302):通过直接Q函数优化提升语言模型多步推理能力的项目。
- [FIRE](https://arxiv.org/abs/2410.21236):基于常规执行采样的大语言模型高效初始化框架。
## 贡献指南
欢迎社区贡献!请查阅我们的[项目路线图](https://github.com/volcengine/verl/issues/22)与[发布计划](https://github.com/volcengine/verl/issues/354)。
### 代码格式化
我们使用yapf(Google风格)对PR进行严格的代码格式化审核。若需在本地重新格式化代码,请确保已安装**最新版本**的`yapf`:
bash
pip3 install yapf --upgrade
随后,请切换至verl仓库根目录并执行以下命令:
bash
bash scripts/format.sh
我们正在招聘!若您对MLSys/大语言模型推理/多模态对齐方向的实习或全职岗位感兴趣,请发送邮件至[haibin.lin@bytedance.com](mailto:haibin.lin@bytedance.com)。
提供机构:
maas
创建时间:
2025-07-29



