RLPR-Evaluation
收藏魔搭社区2025-12-05 更新2025-06-28 收录
下载链接:
https://modelscope.cn/datasets/OpenBMB/RLPR-Evaluation
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for RLPR-Evaluation
[GitHub](https://github.com/openbmb/RLPR) | [Paper](https://huggingface.co/papers/2506.18254)
## News:
* **[2025.06.23]** 📃 Our paper detailing the RLPR framework and its comprehensive evaluation using this suite is accessible at [here](https://arxiv.org/abs/2506.18254)!
## Dataset Summary
We include the following seven benchmarks for evaluation of RLPR:
**Mathematical Reasoning Benchmarks:**
* **MATH-500 ([Cobbe et al., 2021](https://arxiv.org/abs/2110.14168))**
* **Minerva ([Lewkowycz et al., 2022](https://arxiv.org/abs/2206.14858))**
* **AIME24**
**General Domain Reasoning Benchmarks:**
* **MMLU-Pro ([Wang et al., 2024](https://arxiv.org/abs/2406.01574)):** A multitask language understanding benchmark with reasoning-intensive questions. We randomly sample 1000 prompts for a balance of efficiency and variance.
* **GPQA ([Rein et al., 2023](https://arxiv.org/abs/2311.12022)):** Graduate-level questions across disciplines. We use the highest-quality **GPQA-diamond** subset.
* **TheoremQA ([Chen et al., 2023](https://arxiv.org/abs/2305.12524)):** Assesses the ability to apply theorems to solve complex science problems (Math, Physics, etc.). We use 800 high-quality questions, removing 53 multimodal instructions.
* **WebInstruct (Validation Split) ([Ma et al., 2025](https://arxiv.org/abs/2505.14652)):** A held-out validation split from WebInstruct, designed as an accessible benchmark for medium-sized models. We uniformly sample 1k prompts and apply 10-gram deduplication, resulting in **638 distinct questions**.
This multi-faceted suite allows for a thorough evaluation of reasoning capabilities across diverse domains and difficulty levels.
## Usage
```python
from datasets import load_dataset
data = load_dataset("openbmb/RLPR-Evaluation")
```
## Data Fields
The dataset contains the following fields for each sample:
| | Key | Description |
| --- | -------------- | ----------------------------------------------------------------------------------------------- |
| 0 | `data_source` | Identifier for the specific benchmark or split. |
| 1 | `prompt` | The input question or problem statement, potentially with context or instructions. |
| 2 | `ability` | The domain or category of the task. |
| 3 | `reward_model` | Dictionary containing the `ground_truth` answer, essential for scoring. |
| 4 | `extra_info` | Benchmark-specific metadata, such as `answer_type`, `category`, `difficulty`, `id`, or `split`. |
| 5 | `uid` | The uid for item in the dataset |
## Citation
If you use the RLPR framework or refer to our evaluation methodology using this suite, please cite our paper. Additionally, please cite the original papers for any component benchmarks you use:
```bibtex
@misc{yu2025rlprextrapolatingrlvrgeneral,
title={RLPR: Extrapolating RLVR to General Domains without Verifiers},
author={Tianyu Yu and Bo Ji and Shouli Wang and Shu Yao and Zefan Wang and Ganqu Cui and Lifan Yuan and Ning Ding and Yuan Yao and Zhiyuan Liu and Maosong Sun and Tat-Seng Chua},
year={2025},
eprint={2506.18254},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://huggingface.co/papers/2506.18254},
}
```
# RLPR-Evaluation 数据集卡片
[GitHub](https://github.com/openbmb/RLPR) | [Paper](https://huggingface.co/papers/2506.18254)
## 新闻
* **[2025.06.23]** 📃 本团队关于RLPR框架及基于该套件的全面评估的论文现已上线 [此处](https://arxiv.org/abs/2506.18254)!
## 数据集摘要
我们包含以下七个基准测试,用于RLPR的评估:
**数学推理基准(Mathematical Reasoning Benchmarks):**
* **MATH-500 ([Cobbe et al., 2021](https://arxiv.org/abs/2110.14168))**
* **Minerva ([Lewkowycz et al., 2022](https://arxiv.org/abs/2206.14858))**
* **AIME24**
**通用领域推理基准(General Domain Reasoning Benchmarks):**
* **MMLU-Pro ([Wang et al., 2024](https://arxiv.org/abs/2406.01574)):** 多任务语言理解基准,包含大量推理密集型问题。为平衡效率与方差,我们随机采样1000条提示词。
* **GPQA ([Rein et al., 2023](https://arxiv.org/abs/2311.12022)):** 跨学科的研究生级别问题。我们使用质量最高的**GPQA-diamond**子集。
* **TheoremQA ([Chen et al., 2023](https://arxiv.org/abs/2305.12524)):** 用于评估模型应用定理解决复杂科学问题(数学、物理等)的能力。我们使用800条高质量问题,并移除了53条多模态指令。
* **WebInstruct(验证集划分)([Ma et al., 2025](https://arxiv.org/abs/2505.14652)):** 源自WebInstruct的预留验证集,专为中型模型设计的轻量化基准。我们均匀采样1000条提示词并进行10-gram去重,最终得到**638个独立问题**。
该多维度基准套件可实现跨多样领域与难度层级的推理能力全面评估。
## 使用方法
python
from datasets import load_dataset
data = load_dataset("openbmb/RLPR-Evaluation")
## 数据字段
该数据集的每个样本包含以下字段:
| | 键名 | 说明 |
| --- | -------------- | ----------------------------------------------------------------------------------------------- |
| 0 | `data_source` | 对应具体基准测试或划分的标识符。 |
| 1 | `prompt` | 输入问题或命题陈述,可能包含上下文或指令。 |
| 2 | `ability` | 任务所属的领域或类别。 |
| 3 | `reward_model` | 包含标准答案(ground truth)的字典,是评分所必需的核心信息。 |
| 4 | `extra_info` | 基准测试专属元数据,例如`answer_type`、`category`、`difficulty`、`id`或`split`。 |
| 5 | `uid` | 数据集中样本的唯一标识符。 |
## 引用
若您使用RLPR框架或基于本套件参考我们的评估方法论,请引用本团队的论文。此外,请引用您所使用的各基准测试的原始论文:
bibtex
@misc{yu2025rlprextrapolatingrlvrgeneral,
title={RLPR: Extrapolating RLVR to General Domains without Verifiers},
author={Tianyu Yu and Bo Ji and Shouli Wang and Shu Yao and Zefan Wang and Ganqu Cui and Lifan Yuan and Ning Ding and Yuan Yao and Zhiyuan Liu and Maosong Sun and Tat-Seng Chua},
year={2025},
eprint={2506.18254},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://huggingface.co/papers/2506.18254},
}
提供机构:
maas
创建时间:
2025-06-23



