guru_RL_verl

Name: guru_RL_verl
Creator: maas
Published: 2025-12-05 16:37:27
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/LLM360/guru_RL_verl

下载链接

链接失效反馈

官方服务：

资源简介：

# Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective ## Note for this extra-info-compressed data version! The dataset provided in this repository is specifically intended for use with the latest release of VeRL ([v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0)). Since VeRL `rl_dataset.py` processes datasets as datasets.Dataset, it is essential that **the structure of all Parquet files remains fully consistent.** This repository is designed to meet that requirement. In this repo, the structure of all Parquet files across diverse tasks has been unified by nesting all task-specific keys under the `extra_info` field. Additionally, both the `extra_info` and `reward_model` fields store compressed JSON-formatted strings to ensure the entire dataset can be efficiently stored within Parquet files. The practioner's guide to use guru dataset is: 1. If you use [Reasoning360 repo](https://github.com/LLM360/Reasoning360) (a fork of VeRL) directly, use [guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k). 2. If you use the official [VeRL](https://github.com/volcengine/verl?tab=readme-ov-file), use this [guru-RL-92k-extra-info-compressed](https://huggingface.co/datasets/LLM360/guru-RL-92k-extra-info-compressed). The reward computations(provided by [llm-reasoner](https://github.com/maitrix-org/llm-reasoners)) involve decompression and deserialization of compressed info, making it slightly slower than in the original Guru dataset. ## Dataset Description **Guru** is a curated six-domain dataset for training large language models (LLM) for complex reasoning with reinforcement learning (RL). The dataset contains 91.9K high-quality samples spanning six diverse reasoning-intensive domains, processed through a comprehensive five-stage curation pipeline to ensure both domain diversity and reward verifiability. ### Dataset Summary Guru addresses the critical need for robust cross-domain reasoning capabilities in LLMs by providing a carefully balanced collection of problems across **math, coding, science, logic, simulation, and tabular reasoning**. Each sample has been filtered for quality and equipped with automated verification mechanisms, making it ideal for RL applications. ### Key Features - **Cross-Domain Coverage**: Six reasoning domains for LLM reasoning research and skill development - **Quality Assurance**: Five-stage curation pipeline with deduplication and heuristic filtering - **RL-Ready**: Domain-specific reward functions for reliable evaluation - **Difficulty Calibration**: Samples filtered to maintain appropriate challenge levels ### Data Structure The dataset is stored in Parquet format for efficient access and processing. Each sample contains at least the following fields: 1. **data_source** - Type: String - Description: Identifier indicating the origin dataset and domain for mapping specific reward functions 2. **prompt** - Type: List of message objects - Contains: - content: The actual text content - role: Message role (e.g., "user") 3. **ability** - Type: String - Description: The primary reasoning skill tested 4. **apply_chat_template** - Type: Boolean - Description: Flag for chat formatting 5. **qwen2.5_7b_pass_rate** - Type: Float - Description: Pass rate with Qwen 2.5-7B model 6. **qwen3_30b_pass_rate** - Type: Float - Description: Pass rate with Qwen 3-30B model 7. **extra_info** - Type: Dictionary - Description: Supplementary information for reward computing - Note: Detailed structures vary from tasks 8. **reward_model** - Type: Dictionary - Contains: - ground_truth: Compressed answer/verification data - Note: Detailed structures vary from tasks ### Domains and Statistics | Domain | Datasets Included | Final Sample Count | Key Focus Areas | |--------|------------------|-------------------|-----------------| | **Math** | OR1, DAPO, DeepScaler | 54.4K | Competition problems, symbolic reasoning | | **Code** | LeetCode, TACO-Verified, PrimeIntellect, LiveCodeBench | 18.1K | Programming challenges, algorithm design | | **Science** | WebInstruct-Verified | 3.6K | University/PhD-level physics, chemistry, biology | | **Logic** | ARC-AGI, BARC, Custom puzzles | 6.3K | Symbolic reasoning, constraint satisfaction | | **Simulation** | Code I/O (PyEdu) | 3.7K | Code behavior prediction without execution | | **Table** | HiTab, MultiHierTT | 6.1K | Single and multi-table reasoning | **Total Samples**: 91.9K (filtered from 684.3K raw samples) ### Dataset Sources | Domain | Dataset | Source | |--------|---------|--------| | **Math** | OR1 | [Skywork-OR1 (2025)](https://github.com/SkyworkAI/Skywork-O1-Open) | | | DAPO | [DAPO Dataset](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) | | | DeepScaler | [DeepScaleR Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) | | **Code** | LeetCode | [LeetCode Dataset](https://huggingface.co/datasets/greengerong/leetcode) | | | TACO-Verified | [TACO Dataset](https://huggingface.co/datasets/BAAI/TACO) | | | PrimeIntellect | [PrimeIntellect Dataset](https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1) | | | LiveCodeBench (history) | [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) | | **Science** | WebInstruct-Verified | [WebInstruct Dataset](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified) | | **Logic** | Zebra Puzzle | - | | | Ordering Puzzle | - | | | Graph Puzzle | - | | | ARC-AGI-1/2 | [ARC-AGI Dataset](https://arcprize.org/arc-agi) | | | BARC | [BARC Dataset](https://huggingface.co/barc0) | | **Simulation** | Code I/O (PyEdu) | [CodeIO-PyEdu Dataset](https://huggingface.co/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning) | | **Table** | HiTab | [HiTab Dataset](https://github.com/microsoft/HiTab) | | | MultiHierTT | [MultiHierTT Dataset](https://github.com/psunlpgroup/MultiHiertt) | ## Citation If you find this dataset helpful in your research, please consider citing: ```bibtex @misc{cheng2025revisiting, title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective}, author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu}, journal = {arXiv preprint arXiv:2506.14965}, year = {2025}, doi = {10.48550/arXiv.2506.14965}, url = {https://arxiv.org/abs/2506.14965} } ``` *This dataset card follows the Hugging Face dataset card template and provides comprehensive information about the Guru dataset structure, creation process, and intended use cases.*

# 从跨域视角重新审视用于大语言模型推理的强化学习 ## 本压缩额外信息版本数据集说明！本仓库提供的数据集专为最新版VeRL（[v0.4.0](https://github.com/volcengine/verl/releases/tag/v0.4.0)）设计。由于VeRL的`rl_dataset.py`以`datasets.Dataset`格式处理数据集，因此**所有Parquet文件的结构必须完全保持一致**。本仓库正是为满足这一要求而打造。在本仓库中，我们将所有任务专属键嵌套在`extra_info`字段下，统一了跨不同任务的所有Parquet文件结构。此外，`extra_info`与`reward_model`字段均存储压缩后的JSON格式字符串，以确保整个数据集可高效存储于Parquet文件中。 Guru数据集的使用指南如下： 1. 若直接使用[Reasoning360仓库](https://github.com/LLM360/Reasoning360)（VeRL的一个分支），请使用[guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k)数据集。 2. 若使用官方[VeRL](https://github.com/volcengine/verl?tab=readme-ov-file)，请使用本[guru-RL-92k-extra-info-compressed](https://huggingface.co/datasets/LLM360/guru-RL-92k-extra-info-compressed)数据集。奖励计算（由[llm-reasoner](https://github.com/maitrix-org/llm-reasoners)提供）需要对压缩信息进行解压与反序列化，因此相较于原始Guru数据集，其运行速度略慢。 ## 数据集描述 **Guru**是一款精选的六域数据集，用于通过强化学习（Reinforcement Learning，RL）训练大语言模型（Large Language Model，LLM）完成复杂推理任务。该数据集包含91.9K条高质量样本，覆盖六个多样化的推理密集型领域，通过一套完整的五阶段整理流水线进行处理，以确保领域多样性与奖励可验证性。 ### 数据集概览 Guru旨在满足大语言模型对鲁棒跨域推理能力的迫切需求，提供了经过精心平衡的问题集合，覆盖**数学、代码、科学、逻辑、仿真与表格推理**六大领域。每条样本均经过质量过滤，并配备了自动化验证机制，非常适合强化学习应用场景。 ### 核心特性 - **跨域覆盖**：涵盖六大推理领域，支持大语言模型推理研究与能力提升 - **质量保障**：采用五阶段整理流水线，包含去重与启发式过滤环节 - **适配强化学习**：配备领域专属奖励函数，可实现可靠的模型评估 - **难度校准**：对样本进行过滤以维持合适的挑战难度 ### 数据结构该数据集以Parquet格式存储，以实现高效的访问与处理。每条样本至少包含以下字段： 1. **data_source** - 类型：字符串 - 描述：用于标识数据集来源与领域的标识符，以便匹配对应的专属奖励函数 2. **prompt** - 类型：消息对象列表 - 包含内容： - content：实际文本内容 - role：消息角色（例如"user"） 3. **ability** - 类型：字符串 - 描述：测试的核心推理技能 4. **apply_chat_template** - 类型：布尔值 - 描述：用于标识是否需要应用对话模板的标记 5. **qwen2.5_7b_pass_rate** - 类型：浮点数 - 描述：使用Qwen 2.5-7B模型时的通过率 6. **qwen3_30b_pass_rate** - 类型：浮点数 - 描述：使用Qwen 3-30B模型时的通过率 7. **extra_info** - 类型：字典 - 描述：用于奖励计算的补充信息 - 备注：具体结构因任务而异 8. **reward_model** - 类型：字典 - 包含内容： - ground_truth：压缩后的答案/验证数据 - 备注：具体结构因任务而异 ### 领域与统计数据 | 领域 | 包含的数据集 | 最终样本数 | 核心聚焦领域 | |------------|----------------------------------|------------|----------------------------------| | **数学** | OR1、DAPO、DeepScaler | 54.4K | 竞赛题、符号推理 | | **代码** | LeetCode、TACO-Verified、PrimeIntellect、LiveCodeBench | 18.1K | 编程挑战、算法设计 | | **科学** | WebInstruct-Verified | 3.6K | 大学/博士级物理、化学、生物 | | **逻辑** | ARC-AGI、BARC、自定义谜题 | 6.3K | 符号推理、约束满足 | | **仿真** | Code I/O（PyEdu） | 3.7K | 无需执行的代码行为预测 | | **表格** | HiTab、MultiHierTT | 6.1K | 单表与多表推理 | **总样本数**：91.9K（从684.3K原始样本中过滤得到） ### 数据集来源 | 领域 | 数据集 | 来源链接 | |------------|----------------------|--------------------------------------------------------------------------| | **数学** | OR1 | [Skywork-OR1 (2025)](https://github.com/SkyworkAI/Skywork-O1-Open) | | | DAPO | [DAPO数据集](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) | | | DeepScaler | [DeepScaleR数据集](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) | | **代码** | LeetCode | [LeetCode数据集](https://huggingface.co/datasets/greengerong/leetcode) | | | TACO-Verified | [TACO数据集](https://huggingface.co/datasets/BAAI/TACO) | | | PrimeIntellect | [PrimeIntellect数据集](https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1) | | | LiveCodeBench（历史版本） | [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) | | **科学** | WebInstruct-Verified | [WebInstruct数据集](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified) | | **逻辑** | 斑马谜题 | - | | | 排序谜题 | - | | | 图谜题 | - | | | ARC-AGI-1/2 | [ARC-AGI数据集](https://arcprize.org/arc-agi) | | | BARC | [BARC数据集](https://huggingface.co/barc0) | | **仿真** | Code I/O（PyEdu） | [CodeIO-PyEdu数据集](https://huggingface.co/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning) | | **表格** | HiTab | [HiTab数据集](https://github.com/microsoft/HiTab) | | | MultiHierTT | [MultiHierTT数据集](https://github.com/psunlpgroup/MultiHiertt) | ## 引用若您的研究中使用了本数据集，请考虑引用以下文献： bibtex @misc{cheng2025revisiting, title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective}, author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu}, journal = {arXiv preprint arXiv:2506.14965}, year = {2025}, doi = {10.48550/arXiv.2506.14965}, url = {https://arxiv.org/abs/2506.14965} } *本数据集卡片遵循Hugging Face数据集卡片模板，提供了关于Guru数据集结构、创建流程与预期使用场景的全面信息。*

提供机构：

maas

创建时间：

2025-06-05

搜集汇总

数据集介绍