guru-RL-92k
收藏魔搭社区2026-01-06 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/LLM360/guru-RL-92k
下载链接
链接失效反馈官方服务:
资源简介:
# Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
## Dataset Description
**Guru** is a curated six-domain dataset for training large language models (LLM) for complex reasoning with reinforcement learning (RL). The dataset contains 91.9K high-quality samples spanning six diverse reasoning-intensive domains, processed through a comprehensive five-stage curation pipeline to ensure both domain diversity and reward verifiability.
### Dataset Summary
Guru addresses the critical need for robust cross-domain reasoning capabilities in LLMs by providing a carefully balanced collection of problems across **math, coding, science, logic, simulation, and tabular reasoning**. Each sample has been filtered for quality and equipped with automated verification mechanisms, making it ideal for RL applications.
### Key Features
- **Cross-Domain Coverage**: Six reasoning domains for LLM reasoning research and skill development
- **Quality Assurance**: Five-stage curation pipeline with deduplication and heuristic filtering
- **RL-Ready**: Domain-specific reward functions for reliable evaluation
- **Difficulty Calibration**: Samples filtered to maintain appropriate challenge levels
### Data Structure
The dataset is stored in Parquet format for efficient access and processing. Each sample contains at least the following fields:
1. **data_source**
- Type: String
- Description: Identifier indicating the origin dataset and domain for mapping specific reward functions
2. **prompt**
- Type: List of message objects
- Contains:
- content: The actual text content
- role: Message role (e.g., "user")
3. **ability**
- Type: String
- Description: The primary reasoning skill tested
4. **apply_chat_template**
- Type: Boolean
- Description: Flag for chat formatting
5. **qwen2.5_7b_pass_rate**
- Type: Float
- Description: Pass rate with Qwen 2.5-7B model
6. **qwen3_30b_pass_rate**
- Type: Float
- Description: Pass rate with Qwen 3-30B model
7. **extra_info**
- Type: Dictionary
- Description: Supplementary information for reward computing
- Note: Detailed structures vary from tasks
8. **reward_model**
- Type: Dictionary
- Contains:
- ground_truth: Compressed answer/verification data
- Note: Detailed structures vary from tasks
### Domains and Statistics
| Domain | Datasets Included | Final Sample Count | Key Focus Areas |
|--------|------------------|-------------------|-----------------|
| **Math** | OR1, DAPO, DeepScaler | 54.4K | Competition problems, symbolic reasoning |
| **Code** | LeetCode, TACO-Verified, PrimeIntellect, LiveCodeBench | 18.1K | Programming challenges, algorithm design |
| **Science** | WebInstruct-Verified | 3.6K | University/PhD-level physics, chemistry, biology |
| **Logic** | ARC-AGI, BARC, Custom puzzles | 6.3K | Symbolic reasoning, constraint satisfaction |
| **Simulation** | Code I/O (PyEdu) | 3.7K | Code behavior prediction without execution |
| **Table** | HiTab, MultiHierTT | 6.1K | Single and multi-table reasoning |
**Total Samples**: 91.9K (filtered from 684.3K raw samples)
### Dataset Sources
| Domain | Dataset | Source |
|--------|---------|--------|
| **Math** | OR1 | [Skywork-OR1 (2025)](https://github.com/SkyworkAI/Skywork-O1-Open) |
| | DAPO | [DAPO Dataset](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) |
| | DeepScaler | [DeepScaleR Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) |
| **Code** | LeetCode | [LeetCode Dataset](https://huggingface.co/datasets/greengerong/leetcode) |
| | TACO-Verified | [TACO Dataset](https://huggingface.co/datasets/BAAI/TACO) |
| | PrimeIntellect | [PrimeIntellect Dataset](https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1) |
| | LiveCodeBench (history) | [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) |
| **Science** | WebInstruct-Verified | [WebInstruct Dataset](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified) |
| **Logic** | Zebra Puzzle | - |
| | Ordering Puzzle | - |
| | Graph Puzzle | - |
| | ARC-AGI-1/2 | [ARC-AGI Dataset](https://arcprize.org/arc-agi) |
| | BARC | [BARC Dataset](https://huggingface.co/barc0) |
| **Simulation** | Code I/O (PyEdu) | [CodeIO-PyEdu Dataset](https://huggingface.co/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning) |
| **Table** | HiTab | [HiTab Dataset](https://github.com/microsoft/HiTab) |
| | MultiHierTT | [MultiHierTT Dataset](https://github.com/psunlpgroup/MultiHiertt) |
## Citation
If you find this dataset helpful in your research, please consider citing:
```bibtex
@misc{cheng2025revisiting,
title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective},
author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu},
journal = {arXiv preprint arXiv:2506.14965},
year = {2025},
doi = {10.48550/arXiv.2506.14965},
url = {https://arxiv.org/abs/2506.14965}
}
```
*This dataset card follows the Hugging Face dataset card template and provides comprehensive information about the Guru dataset structure, creation process, and intended use cases.*
# 从跨域视角重新审视用于大语言模型的强化学习
## 数据集说明
**Guru** 是一套经过精心甄选的六领域数据集,用于通过强化学习(Reinforcement Learning,RL)训练具备复杂推理能力的大语言模型(Large Language Model,LLM)。该数据集包含91.9K条高质量样本,覆盖六个多样化的推理密集型领域,并通过一套完整的五阶段精选流程进行处理,以确保领域多样性与奖励可验证性。
### 数据集概览
Guru 针对大语言模型亟需鲁棒的跨域推理能力这一关键需求,提供了一套经过精心平衡的问题集合,涵盖**数学、编码、科学、逻辑、仿真与表格推理**六大领域。每条样本均经过质量过滤,并配备了自动化验证机制,非常适用于强化学习相关应用。
### 核心特性
- **跨域覆盖**:涵盖六大推理领域,支持大语言模型推理研究与技能培养
- **质量保障**:包含去重与启发式过滤的五阶段精选流程
- **适配强化学习**:配备领域专属奖励函数,可实现可靠评估
- **难度校准**:对样本进行过滤以维持合适的挑战等级
### 数据结构
数据集以Parquet格式存储,以实现高效访问与处理。每条样本至少包含以下字段:
1. **data_source**
- 类型:字符串
- 描述:用于标识原始数据集与所属领域的标识符,以便匹配特定的奖励函数
2. **prompt**
- 类型:消息对象列表
- 包含内容:
- content:实际文本内容
- role:消息角色(例如"user")
3. **ability**
- 类型:字符串
- 描述:所测试的核心推理技能
4. **apply_chat_template**
- 类型:布尔值
- 描述:用于标识是否需要应用聊天模板的标记
5. **qwen2.5_7b_pass_rate**
- 类型:浮点数
- 描述:使用Qwen 2.5-7B模型时的通过率
6. **qwen3_30b_pass_rate**
- 类型:浮点数
- 描述:使用Qwen 3-30B模型时的通过率
7. **extra_info**
- 类型:字典
- 描述:用于奖励计算的补充信息
- 备注:具体结构因任务而异
8. **reward_model**
- 类型:字典
- 包含内容:
- ground_truth:压缩后的答案/验证数据
- 备注:具体结构因任务而异
### 领域与统计信息
| 领域 | 包含的数据集 | 最终样本数 | 核心聚焦方向 |
|--------------|----------------------------|------------|----------------------------------|
| **数学** | OR1、DAPO、DeepScaler | 54.4K | 竞赛试题、符号推理 |
| **编码** | LeetCode、TACO-Verified、PrimeIntellect、LiveCodeBench | 18.1K | 编程挑战、算法设计 |
| **科学** | WebInstruct-Verified | 3.6K | 本科/博士阶段物理、化学、生物 |
| **逻辑** | ARC-AGI、BARC、自定义谜题 | 6.3K | 符号推理、约束满足 |
| **仿真** | Code I/O (PyEdu) | 3.7K | 无需执行的代码行为预测 |
| **表格** | HiTab、MultiHierTT | 6.1K | 单表与多表推理 |
**总样本数**:91.9K(从684.3K原始样本中过滤得到)
### 数据集来源
| 领域 | 数据集名称 | 来源链接 |
|--------------|--------------------------|--------------------------------------------------------------------------|
| **数学** | OR1 | [Skywork-OR1 (2025)](https://github.com/SkyworkAI/Skywork-O1-Open) |
| | DAPO | [DAPO 数据集](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) |
| | DeepScaler | [DeepScaleR 数据集](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) |
| **编码** | LeetCode | [LeetCode 数据集](https://huggingface.co/datasets/greengerong/leetcode) |
| | TACO-Verified | [TACO 数据集](https://huggingface.co/datasets/BAAI/TACO) |
| | PrimeIntellect | [PrimeIntellect 数据集](https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1) |
| | LiveCodeBench (历史版本) | [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) |
| **科学** | WebInstruct-Verified | [WebInstruct 数据集](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified) |
| **逻辑** | Zebra Puzzle | - |
| | Ordering Puzzle | - |
| | Graph Puzzle | - |
| | ARC-AGI-1/2 | [ARC-AGI 数据集](https://arcprize.org/arc-agi) |
| | BARC | [BARC 数据集](https://huggingface.co/barc0) |
| **仿真** | Code I/O (PyEdu) | [CodeIO-PyEdu 数据集](https://huggingface.co/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning) |
| **表格** | HiTab | [HiTab 数据集](https://github.com/microsoft/HiTab) |
| | MultiHierTT | [MultiHierTT 数据集](https://github.com/psunlpgroup/MultiHiertt) |
## 引用说明
若您的研究中使用了本数据集,请引用以下文献:
bibtex
@misc{cheng2025revisiting,
title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective},
author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu},
journal = {arXiv preprint arXiv:2506.14965},
year = {2025},
doi = {10.48550/arXiv.2506.14965},
url = {https://arxiv.org/abs/2506.14965}
}
*本数据集卡片遵循Hugging Face数据集卡片模板,提供了关于Guru数据集的结构、创建流程与预期使用场景的全面信息。*
提供机构:
maas
创建时间:
2025-06-19
搜集汇总
数据集介绍

背景与挑战
背景概述
Guru是一个专为大型语言模型设计的强化学习数据集,包含91.9K个高质量样本,覆盖数学、编码、科学、逻辑、模拟和表格推理六个领域。这些样本经过五阶段筛选流程,确保领域多样性和奖励可验证性,适用于复杂推理任务的训练。
以上内容由遇见数据集搜集并总结生成



