five

guru_RL

收藏
魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/LLM360/guru_RL
下载链接
链接失效反馈
官方服务:
资源简介:
# Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective ## Dataset Description **Guru** is a curated six-domain dataset for training large language models (LLM) for complex reasoning with reinforcement learning (RL). The dataset contains 91.9K high-quality samples spanning six diverse reasoning-intensive domains, processed through a comprehensive five-stage curation pipeline to ensure both domain diversity and reward verifiability. ### Dataset Summary Guru addresses the critical need for robust cross-domain reasoning capabilities in LLMs by providing a carefully balanced collection of problems across **math, coding, science, logic, simulation, and tabular reasoning**. Each sample has been filtered for quality and equipped with automated verification mechanisms, making it ideal for RL applications. ### Key Features - **Cross-Domain Coverage**: Six reasoning domains for LLM reasoning research and skill development - **Quality Assurance**: Five-stage curation pipeline with deduplication and heuristic filtering - **RL-Ready**: Domain-specific reward functions for reliable evaluation - **Difficulty Calibration**: Samples filtered to maintain appropriate challenge levels ### Data Structure The dataset is stored in Parquet format for efficient access and processing. Each sample contains at least the following fields: 1. **data_source** - Type: String - Description: Identifier indicating the origin dataset and domain for mapping specific reward functions 2. **prompt** - Type: List of message objects - Contains: - content: The actual text content - role: Message role (e.g., "user") 3. **ability** - Type: String - Description: The primary reasoning skill tested 4. **apply_chat_template** - Type: Boolean - Description: Flag for chat formatting 5. **qwen2.5_7b_pass_rate** - Type: Float - Description: Pass rate with Qwen 2.5-7B model 6. **qwen3_30b_pass_rate** - Type: Float - Description: Pass rate with Qwen 3-30B model 7. **extra_info** - Type: Dictionary - Description: Supplementary information for reward computing - Note: Detailed structures vary from tasks 8. **reward_model** - Type: Dictionary - Contains: - ground_truth: Compressed answer/verification data - Note: Detailed structures vary from tasks ### Domains and Statistics | Domain | Datasets Included | Final Sample Count | Key Focus Areas | |--------|------------------|-------------------|-----------------| | **Math** | OR1, DAPO, DeepScaler | 54.4K | Competition problems, symbolic reasoning | | **Code** | LeetCode, TACO-Verified, PrimeIntellect, LiveCodeBench | 18.1K | Programming challenges, algorithm design | | **Science** | WebInstruct-Verified | 3.6K | University/PhD-level physics, chemistry, biology | | **Logic** | ARC-AGI, BARC, Custom puzzles | 6.3K | Symbolic reasoning, constraint satisfaction | | **Simulation** | Code I/O (PyEdu) | 3.7K | Code behavior prediction without execution | | **Table** | HiTab, MultiHierTT | 6.1K | Single and multi-table reasoning | **Total Samples**: 91.9K (filtered from 684.3K raw samples) ### Dataset Sources | Domain | Dataset | Source | |--------|---------|--------| | **Math** | OR1 | [Skywork-OR1 (2025)](https://github.com/SkyworkAI/Skywork-O1-Open) | | | DAPO | [DAPO Dataset](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) | | | DeepScaler | [DeepScaleR Dataset](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) | | **Code** | LeetCode | [LeetCode Dataset](https://huggingface.co/datasets/greengerong/leetcode) | | | TACO-Verified | [TACO Dataset](https://huggingface.co/datasets/BAAI/TACO) | | | PrimeIntellect | [PrimeIntellect Dataset](https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1) | | | LiveCodeBench (history) | [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) | | **Science** | WebInstruct-Verified | [WebInstruct Dataset](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified) | | **Logic** | Zebra Puzzle | - | | | Ordering Puzzle | - | | | Graph Puzzle | - | | | ARC-AGI-1/2 | [ARC-AGI Dataset](https://arcprize.org/arc-agi) | | | BARC | [BARC Dataset](https://huggingface.co/barc0) | | **Simulation** | Code I/O (PyEdu) | [CodeIO-PyEdu Dataset](https://huggingface.co/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning) | | **Table** | HiTab | [HiTab Dataset](https://github.com/microsoft/HiTab) | | | MultiHierTT | [MultiHierTT Dataset](https://github.com/psunlpgroup/MultiHiertt) | ## Citation If you find this dataset helpful in your research, please consider citing: ```bibtex @misc{cheng2025revisiting, title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective}, author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu}, journal = {arXiv preprint arXiv:2506.14965}, year = {2025}, doi = {10.48550/arXiv.2506.14965}, url = {https://arxiv.org/abs/2506.14965} } ``` *This dataset card follows the Hugging Face dataset card template and provides comprehensive information about the Guru dataset structure, creation process, and intended use cases.*

# 从跨域视角重新审视用于大语言模型(Large Language Model, LLM)推理的强化学习 ## 数据集描述 **Guru**是一个面向强化学习(Reinforcement Learning, RL)训练大语言模型完成复杂推理任务的精选六域数据集。该数据集包含91.9K高质量样本,覆盖六个推理密集型领域,通过一套完整的五阶段精选流程进行处理,以确保领域多样性与奖励可验证性。 ### 数据集概览 Guru针对大语言模型在跨域推理能力上的迫切需求,提供了在**数学、代码、科学、逻辑、仿真、表格推理**六大领域间精心平衡的问题集合。每一个样本都经过了质量过滤,并配备了自动化验证机制,非常适合强化学习应用场景。 ### 核心特性 - **跨域覆盖**:涵盖六大推理领域,支持LLM推理研究与能力培养 - **质量保障**:采用五阶段精选流程,包含去重与启发式过滤环节 - **适配强化学习**:提供领域专属奖励函数以实现可靠评估 - **难度校准**:对样本进行筛选以维持合适的挑战等级 ### 数据结构 该数据集以Parquet格式存储,以实现高效的访问与处理。每个样本至少包含以下字段: 1. **data_source** - 类型:字符串 - 描述:用于标识原始数据集与领域的标识符,以便匹配特定的奖励函数 2. **prompt** - 类型:消息对象列表 - 包含内容: - content:实际文本内容 - role:消息角色(例如"user") 3. **ability** - 类型:字符串 - 描述:测试的核心推理技能 4. **apply_chat_template** - 类型:布尔值 - 描述:用于指示是否需要应用对话模板的标记 5. **qwen2.5_7b_pass_rate** - 类型:浮点数 - 描述:使用Qwen 2.5-7B模型时的通过率 6. **qwen3_30b_pass_rate** - 类型:浮点数 - 描述:使用Qwen 3-30B模型时的通过率 7. **extra_info** - 类型:字典 - 描述:用于奖励计算的补充信息 - 备注:具体结构因任务而异 8. **reward_model** - 类型:字典 - 包含内容: - ground_truth:压缩后的答案/验证数据 - 备注:具体结构因任务而异 ### 领域与统计数据 | 领域 | 包含的数据集 | 最终样本量 | 核心聚焦方向 | |------------|----------------------------|------------|----------------------------------| | **数学** | OR1、DAPO、DeepScaler | 54.4K | 竞赛题、符号推理 | | **代码** | LeetCode、TACO-Verified、PrimeIntellect、LiveCodeBench | 18.1K | 编程挑战、算法设计 | | **科学** | WebInstruct-Verified | 3.6K | 本科/博士阶段物理、化学、生物 | | **逻辑** | ARC-AGI、BARC、自定义谜题 | 6.3K | 符号推理、约束满足问题 | | **仿真** | Code I/O (PyEdu) | 3.7K | 无需执行的代码行为预测 | | **表格** | HiTab、MultiHierTT | 6.1K | 单表与多表推理 | **总样本量**:91.9K(从684.3K原始样本中筛选得到) ### 数据集来源 | 领域 | 数据集名称 | 来源链接 | |------------|--------------------------|--------------------------------------------------------------------------| | **数学** | OR1 | [Skywork-OR1 (2025)](https://github.com/SkyworkAI/Skywork-O1-Open) | | | DAPO | [DAPO数据集](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) | | | DeepScaler | [DeepScaleR数据集](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) | | **代码** | LeetCode | [LeetCode数据集](https://huggingface.co/datasets/greengerong/leetcode) | | | TACO-Verified | [TACO数据集](https://huggingface.co/datasets/BAAI/TACO) | | | PrimeIntellect | [PrimeIntellect数据集](https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-1) | | | LiveCodeBench (历史版本) | [LiveCodeBench](https://github.com/LiveCodeBench/LiveCodeBench) | | **科学** | WebInstruct-Verified | [WebInstruct数据集](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified) | | **逻辑** | 斑马谜题(Zebra Puzzle) | - | | | 排序谜题(Ordering Puzzle) | - | | | 图谜题(Graph Puzzle) | - | | | ARC-AGI-1/2 | [ARC-AGI数据集](https://arcprize.org/arc-agi) | | | BARC | [BARC数据集](https://huggingface.co/barc0) | | **仿真** | Code I/O (PyEdu) | [CodeIO-PyEdu数据集](https://huggingface.co/datasets/hkust-nlp/CodeIO-PyEdu-Reasoning) | | **表格** | HiTab | [HiTab数据集](https://github.com/microsoft/HiTab) | | | MultiHierTT | [MultiHierTT数据集](https://github.com/psunlpgroup/MultiHiertt) | ### 引用说明 如果您在研究中使用了该数据集,请引用以下文献: bibtex @misc{cheng2025revisiting, title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective}, author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu}, journal = {arXiv preprint arXiv:2506.14965}, year = {2025}, doi = {10.48550/arXiv.2506.14965}, url = {https://arxiv.org/abs/2506.14965} } *本数据集卡片采用Hugging Face数据集卡片模板规范编写,全面涵盖了Guru数据集的结构、构建流程与适用场景等信息。*
提供机构:
maas
创建时间:
2025-06-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作