RLPR-Train-Dataset
收藏魔搭社区2026-01-06 更新2025-06-28 收录
下载链接:
https://modelscope.cn/datasets/OpenBMB/RLPR-Train-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for RLPR-Train-Dataset
[GitHub](https://github.com/openbmb/RLPR) | [Paper](https://arxiv.org/abs/2506.18254)
## News:
* **[2025.06.23]** 📃 Our paper detailing the RLPR framework and this dataset is accessible at [here](https://arxiv.org/abs/2506.18254).
## Dataset Summary
The **RLPR-Train-Dataset** is a curated collection of **77k high-quality reasoning prompts** specifically designed for enhancing Large Language Model (LLM) capabilities in the **general domain (non-mathematical)**.
This dataset is derived from the comprehensive collection of prompts from [WebInstruct](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub). We focused on its utility for general-domain reasoning by:
1. Selecting **only non-mathematics prompts**.
2. Employing **GPT-4.1 ([OpenAI, 2025](https://openai.com/index/gpt-4-1/)) to filter out prompts that were too easy**, ensuring a challenging and effective training set.
Training models with the RLPR framework, which utilizes this dataset, enables them to **substantially enhance reasoning capabilities without requiring external verifiers**. This dataset is instrumental in developing models that can effectively tackle complex reasoning across diverse non-mathematical topics.

Models trained using the RLPR framework, which leverages this dataset, demonstrate significant improvements on various benchmarks. For instance, RLPR with Qwen2.5-7B achieved **56.0 on MMLU-Pro** and **55.4 on TheoremQA**.

The focus on diverse, non-mathematical, and appropriately challenging prompts within this dataset contributes to the **robustness and generalizability** of the RLPR framework in improving reasoning for general-domain tasks.
## Usage
```python
from datasets import load_dataset
from pprint import pprint
# --- User Configuration ---
DATASET_ID = "openbmb/RLPR-Train-Dataset" # Dataset ID on Hugging Face
INDEX_TO_SHOW = 0 # Index of the item to display
# --- End User Configuration ---
SPLIT_NAME_TO_USE = "train"
def show_item_details(dataset_id: str, index: int, split_name: str = None):
"""
Loads a dataset and displays the item at the given index,
showing each field and its value.
"""
print(f"Loading dataset: '{dataset_id}'...")
dataset_dict = load_dataset(dataset_id)
available_splits = list(dataset_dict.keys())
selected_split_key = split_name
data_subset = dataset_dict[selected_split_key]
if not (0 <= index < len(data_subset)):
print(f"Error: Index {index} is out of bounds for split '{selected_split_key}' (size: {len(data_subset)}).")
print(f"Please provide an index between 0 and {len(data_subset) - 1}.")
return
item = data_subset[index]
print(f"\n--- Item at index {index} from split '{selected_split_key}' of dataset '{dataset_id}' ---")
for field, value in item.items():
print(f"\nField: '{field}'")
print("Value:")
pprint(value, indent=2, width=100, sort_dicts=False) # Use pprint for readability
print("--- End of item ---")
if __name__ == "__main__":
show_item_details(
dataset_id=DATASET_ID,
index=INDEX_TO_SHOW,
split_name=SPLIT_NAME_TO_USE
)
```
You will get the following output:
```
--- Item at index 0 from split 'train' of dataset 'openbmb/RLPR-Train-Dataset' ---
Field: 'data_source'
Value:
'WebInstruct-verified'
Field: 'prompt'
Value:
[ { 'content': 'A conversation between User and Assistant. The user asks a question, and the '
'Assistant solves it. The assistant first thinks about the reasoning process in the '
'mind and then provides the user with the answer. The reasoning process and answer '
'are enclosed within <think> </think> and <answer> </answer> tags, respectively, '
'i.e., <think> reasoning process here </think> <answer> answer here </answer>.',
'role': 'system'},
{ 'content': 'If the firm could reduce the average age of its inventory from 73 days, to 63 day, '
'by how much would it reduce its dollar investment in working capital?',
'role': 'user'}]
Field: 'ability'
Value:
'Business'
Field: 'reward_model'
Value:
{'ground_truth': '2.74%', 'style': 'rule'}
Field: 'extra_info'
Value:
{ 'answer_type': 'Percentage',
'category': 'Business',
'completion_tokens': 153,
'difficulty': 'Senior High School',
'id': '1904374',
'prompt_tokens': 799,
'reasoning_score': 3,
'reasoning_score_response': 'To evaluate the reasoning level requirement of the given question, '
'we need to consider the nature of the problem. The question asks '
'about the impact of reducing the average age of inventory on the '
'dollar investment in working capital. \n'
'\n'
'This requires an understanding of inventory management and working '
'capital calculations. Specifically, it involves knowledge of how '
'inventory turnover affects working capital needs. The respondent '
'must apply this knowledge to calculate the reduction in dollar '
'investment based on the change in inventory age. \n'
'\n'
'While the problem does not require deep analysis or a comprehensive '
'strategy, it does necessitate moderate reasoning skills and '
'knowledge of financial concepts. Therefore, it is not a simple '
'recall of facts, nor is it overly complex. \n'
'\n'
'Given these considerations, I would assign the following score:\n'
'\n'
'Reasoning score: 3',
'total_tokens': 952}
Field: '__index_level_0__'
Value:
0
--- End of item ---
```
## Data Fields
The dataset contains the following fields for each sample:
| | Key | Description |
| --- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 0 | `data_source` | The original source or collection from which the prompt was derived. |
| 1 | `prompt` | A list of dictionaries representing the conversational prompt provided to the LLM. Each dictionary contains a 'role' and 'content'. The system message defines the expected reasoning format. |
| 2 | `ability` | The category or domain of the reasoning task. This reflects the general domain focus of the RLPR dataset. |
| 3 | `reward_model` | A dictionary containing information related to the reference answer used for reward calculation in the RLPR framework. This includes: <br> - `ground_truth`: The reference answer string. <br> - `style`: Potentially metadata about the ground truth. |
| 4 | `extra_info` | A dictionary containing various metadata about the prompt and its associated information. This includes: <br> - `answer_type`: The expected format/type of the answer. <br> - `category`: A more specific category. <br> - `difficulty`: An assessment of the prompt's difficulty level. <br> - `id`: A unique identifier for the prompt. <br> - `reasoning_score_response`: A textual explanation or rationale for an assigned reasoning score. <br> - `total_tokens`: Token counts. |
| 5 | `_index_level_0_` | An internal index for the data sample |
## Acknowledgement
This dataset is sourced from [WebInstruct](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub).
## Citation
If you find our model/code/paper helpful, please consider citing our papers 📝:
```bibtex
@misc{yu2025rlprextrapolatingrlvrgeneral,
title={RLPR: Extrapolating RLVR to General Domains without Verifiers},
author={Tianyu Yu and Bo Ji and Shouli Wang and Shu Yao and Zefan Wang and Ganqu Cui and Lifan Yuan and Ning Ding and Yuan Yao and Zhiyuan Liu and Maosong Sun and Tat-Seng Chua},
year={2025},
eprint={2506.18254},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.18254},
}
```
# RLPR-Train-Dataset 数据集卡片
[GitHub](https://github.com/openbmb/RLPR) | [论文](https://arxiv.org/abs/2506.18254)
## 动态:
* **[2025.06.23]** 📃 我们关于RLPR框架与本数据集的研究论文已在[此处](https://arxiv.org/abs/2506.18254)上线。
## 数据集概览
**RLPR-Train-Dataset** 是一个精心整理的77,000条高质量推理提示词集合,专为提升大语言模型(Large Language Model, LLM)在**通用领域(非数学方向)**的推理能力而打造。
本数据集源自[WebInstruct](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub)的全量提示词集合。我们聚焦其在通用领域推理场景的实用性,通过以下两步完成筛选:
1. **仅保留非数学类提示词**。
2. 采用**GPT-4.1(OpenAI, 2025, https://openai.com/index/gpt-4-1/)**过滤掉过于简单的提示词,以确保训练集兼具挑战性与训练有效性。
使用本数据集配合RLPR框架训练模型,可使其**在无需外部验证器的前提下,显著提升推理能力**。本数据集可助力开发能够有效解决各类非数学复杂推理任务的模型。

使用本数据集与RLPR框架训练的模型,在多项基准测试中展现出显著性能提升。例如,搭配Qwen2.5-7B的RLPR在MMLU-Pro上取得**56.0分**,在TheoremQA上取得**55.4分**。

本数据集聚焦多样化、非数学且难度适中的提示词,有助于提升RLPR框架在通用领域推理任务中的**鲁棒性与泛化能力**。
## 使用方法
python
from datasets import load_dataset
from pprint import pprint
# --- 用户配置项 ---
DATASET_ID = "openbmb/RLPR-Train-Dataset" # Hugging Face 平台上的数据集ID
INDEX_TO_SHOW = 0 # 待展示的数据项索引
# --- 用户配置项结束 ---
SPLIT_NAME_TO_USE = "train"
def show_item_details(dataset_id: str, index: int, split_name: str = None):
"""
加载数据集并展示指定索引处的数据项,
显示每个字段及其对应的值。
"""
print(f"Loading dataset: '{dataset_id}'...")
dataset_dict = load_dataset(dataset_id)
available_splits = list(dataset_dict.keys())
selected_split_key = split_name
data_subset = dataset_dict[selected_split_key]
if not (0 <= index < len(data_subset)):
print(f"错误:索引 {index} 超出了分割集 '{selected_split_key}' 的范围(数据集大小:{len(data_subset)})。")
print(f"请提供介于 0 至 {len(data_subset) - 1} 之间的索引值。")
return
item = data_subset[index]
print(f"
--- 数据集 '{dataset_id}' 的 '{selected_split_key}' 分割集中索引为 {index} 的数据项 ---")
for field, value in item.items():
print(f"
字段:'{field}'")
print("值:")
pprint(value, indent=2, width=100, sort_dicts=False) # 使用 pprint 提升可读性
print("--- 数据项展示结束 ---")
if __name__ == "__main__":
show_item_details(
dataset_id=DATASET_ID,
index=INDEX_TO_SHOW,
split_name=SPLIT_NAME_TO_USE
)
运行上述代码将得到如下输出:
--- 数据集 'openbmb/RLPR-Train-Dataset' 的 'train' 分割集中索引为 0 的数据项 ---
字段:'data_source'
值:
'WebInstruct-verified'
字段:'prompt'
值:
[ { 'content': '用户与助手的对话示例。用户提出问题,助手进行解答。助手需先在脑海中完成推理过程,再向用户给出答案。推理过程与答案需分别用 <think> </think> 与 <answer> </answer> 标签包裹,即 <think> 推理过程内容 </think> <answer> 答案内容 </answer>。',
'role': 'system'},
{ 'content': '若某企业可将其库存平均周转天数从73天降至63天,那么其营运资本的美元投入将减少多少?',
'role': 'user'}]
字段:'ability'
值:
'Business'
字段:'reward_model'
值:
{'ground_truth': '2.74%', 'style': 'rule'}
字段:'extra_info'
值:
{ 'answer_type': 'Percentage',
'category': 'Business',
'completion_tokens': 153,
'difficulty': 'Senior High School',
'id': '1904374',
'prompt_tokens': 799,
'reasoning_score': 3,
'reasoning_score_response': '为评估给定问题的推理难度要求,我们需考虑问题的本质。该问题询问将库存平均周转天数减少对营运资本美元投入的影响。
这需要理解库存管理与营运资本计算的相关知识。具体而言,需掌握库存周转率如何影响营运资本需求。受访者需运用该知识,根据库存周转天数的变化计算营运资本投入的减少量。
尽管该问题无需深度分析或全面策略,但确实需要中等水平的推理能力与金融概念知识。因此,这并非简单的事实回忆,也不过于复杂。
综合以上考虑,我将给出如下评分:
推理评分:3',
'total_tokens': 952}
字段:'__index_level_0__'
值:
0
--- 数据项展示结束 ---
## 数据字段
本数据集的每个样本包含以下字段:
| 序号 | 键名 | 描述 |
| ---- | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 0 | `data_source` | 提示词的原始来源或所属数据集集合。 |
| 1 | `prompt` | 一个字典列表,代表提供给大语言模型的会话式提示词。每个字典包含`role`与`content`字段。其中系统消息用于定义预期的推理格式。 |
| 2 | `ability` | 推理任务的类别或领域,体现了RLPR数据集的通用领域聚焦方向。 |
| 3 | `reward_model` | 一个字典,包含RLPR框架中用于奖励计算的参考答案相关信息,包含:<br>- `ground_truth`:参考答案字符串。<br>- `style`:参考答案相关的元数据。 |
| 4 | `extra_info` | 一个字典,包含提示词及其关联信息的各类元数据,包含:<br>- `answer_type`:答案的预期格式/类型。<br>- `category`:更细分的任务类别。<br>- `difficulty`:提示词难度评估。<br>- `id`:提示词的唯一标识符。<br>- `reasoning_score_response`:为所分配的推理评分提供的文字解释或依据。<br>- `total_tokens`:Token计数。 |
| 5 | `_index_level_0_` | 数据样本的内部索引字段。 |
## 致谢
本数据集源自[WebInstruct](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub)。
## 引用
若您认为我们的模型/代码/论文有所帮助,请考虑引用我们的论文 📝:
bibtex
@misc{yu2025rlprextrapolatingrlvrgeneral,
title={RLPR: Extrapolating RLVR to General Domains without Verifiers},
author={Tianyu Yu and Bo Ji and Shouli Wang and Shu Yao and Zefan Wang and Ganqu Cui and Lifan Yuan and Ning Ding and Yuan Yao and Zhiyuan Liu and Maosong Sun and Tat-Seng Chua},
year={2025},
eprint={2506.18254},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.18254},
}
提供机构:
maas
创建时间:
2025-06-23



