GenPRM-MATH-Data
收藏魔搭社区2026-01-02 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/GenPRM/GenPRM-MATH-Data
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card
This dataset contains **23K conversations**, serving as the official training data for the GenPRM model. The data synthesis and filtering procedures follow the methodology described in our paper:
- Project Page: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://ryanliu112.github.io/GenPRM/)
- Paper: [https://arxiv.org/abs/2504.00891](https://arxiv.org/abs/2504.00891)
- Code: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
# Models
**SFT models are available in the following repositories:**
- [GenPRM-7B](https://huggingface.co/GenPRM/GenPRM-7B)
- [GenPRM-1.5B](https://huggingface.co/GenPRM/GenPRM-1.5B)
# Data Format and Structure
The data is organized as conversations, designed to align with the training requirements of the [axolotl](https://github.com/axolotl-ai-cloud/axolotl). Each data instance adheres to the following schema:
- system: system prompt is fixed as "You are a math teacher. Your task is to review and critique the paragraphs in solution step by step."
- user: user content consists of **math problem** (at the begining of the first paragraph) and **solution** (splited into paragraphs)
- assistant: assistant content consists of **step-level verification** for user's solution paragraph. Every assistant content consists of 3 parts:
- Chain-of-Thoughts reasoning (within `<analyze></analyze>`)
- Code-based verification (within `<verify></verify>`)
- Correctness judgement (within `<output></output>`).
GenPRM is only trained on assistant role. We extract the probability of `Yes` token in `<output>` stage as the step reward while reasoning.
```json
[
{
"role": "system",
"content": "You are a math teacher. Your task is to review and critique the paragraphs in solution step by step.",
},
{
"role": "user",
"content": "Question: Let $f(x)=x^2-7x+18$ and let $g(f(x))=2x+3$. What is the sum of all possible values of $g(8)$?\n\nTo solve the problem, we need to first understand the given functions and how they interact with each other. We are given $f(x) = x^2 - 7x + 18$ and $g(f(x)) = 2x + 3$.",
},
{
"role": "assistant",
"content": "<analyze>\nParagraph 1 is correct. It accurately states the given functions and the problem's goal. The problem provides f(x) = x² - 7x + 18 and g(f(x)) = 2x + 3. The user needs to find the sum of all possible values of g(8). The paragraph correctly sets up the functions without any errors.\n</analyze>\n<verify>\n```python\n# No direct computation needed here; the analysis is about understanding the problem setup.\n# Since there's no calculation required, we can just confirm that the functions are correctly presented.\nprint(\"Functions are correctly stated as per problem.\")\n```\n[Code Output]\n\n```\nFunctions are correctly stated as per problem.\n```\n</verify>\n<output>\n**Judgement**: $\\boxed{Yes}$\n</output>\n",
},
{
"role": "user",
"content": "The goal is to find the sum of all possible values of $g(8)$.",
},
{
"role": "assistant",
"content": "<analyze>\nParagraph 2 is correct. The problem asks for the sum of all possible values of g(8). This paragraph correctly reiterates the goal without any inaccuracies.\n</analyze>\n<verify>\n```python\n# No computation required here as it's just restating the problem's objective.\nprint(\"Objective correctly restated.\")\n```\n[Code Output]\n\n```\nObjective correctly restated.\n```\n</verify>\n<output>\n**Judgement**: $\\boxed{Yes}$\n</output>\n",
},
...
]
```
# Citation
If you find this work helpful, please kindly cite our paper:
```bibtex
@article{zhao2025genprm,
title = {GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning},
author = {Jian Zhao and Runze Liu and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
journal = {arXiv preprint arXiv:2504.00891},
year = {2025}
}
```
Our collection of PRMs in [Awesome-Process-Reward-Models](https://github.com/RyanLiu112/Awesome-Process-Reward-Models):
```bibtex
@misc{Awesome-Process-Reward-Models,
title = {Awesome Process Reward Models},
author = {Runze Liu and Jian Zhao and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
howpublished = {\url{https://github.com/RyanLiu112/Awesome-Process-Reward-Models}},
note = {GitHub repository},
year = {2025}
}
```
Our recent work on LLM test-time scaling with PRMs:
```bibtex
@article{liu2025can,
title = {Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling},
author = {Runze Liu and Junqi Gao and Jian Zhao and Kaiyan Zhang and Xiu Li and Biqing Qi and Wanli Ouyang and Bowen Zhou},
journal = {arXiv preprint arXiv:2502.06703},
year = {2025}
}
```
# 数据集卡片
本数据集包含23000条对话,为GenPRM模型的官方训练数据。数据集的合成与筛选流程遵循我们论文中阐述的方法:
- 项目页面:[GenPRM: 通过生成式推理扩展过程奖励模型的测试期计算量](https://ryanliu112.github.io/GenPRM/)
- 论文:[https://arxiv.org/abs/2504.00891](https://arxiv.org/abs/2504.00891)
- 代码仓库:[https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
# 模型资源
以下仓库中提供了监督微调(Supervised Fine-Tuning, SFT)模型:
- [GenPRM-7B](https://huggingface.co/GenPRM/GenPRM-7B)
- [GenPRM-1.5B](https://huggingface.co/GenPRM/GenPRM-1.5B)
# 数据格式与结构
本数据集以对话形式组织,适配axolotl训练框架的训练需求。每条数据实例遵循以下格式规范:
- 系统角色(system):系统提示词固定为“你是一名数学教师,你的任务是逐段审阅并评析解题步骤中的文本。”
- 用户角色(user):用户内容由**数学题目**(位于首段开头)和**解题步骤**(拆分为多个独立段落)组成
- 助手角色(assistant):助手内容为针对用户解题段落的**步骤级验证**,每条助手回复包含三部分:
1. 思维链推理(包裹在`<analyze></analyze>`标签内)
2. 基于代码的验证(包裹在`<verify></verify>`标签内)
3. 正确性判定(包裹在`<output></output>`标签内)。
GenPRM仅在助手角色的对话数据上进行训练。在推理过程中,我们提取`<output>`阶段中`Yes`这个Token的概率作为步骤奖励。
json
[
{
"role": "system",
"content": "你是一名数学教师,你的任务是逐段审阅并评析解题步骤中的文本。",
},
{
"role": "user",
"content": "问题:已知$f(x)=x^2-7x+18$,且$g(f(x))=2x+3$。求$g(8)$所有可能取值的和。
为解决该问题,我们首先需要明确已知函数及其相互关系。题目给出$f(x) = x^2 - 7x + 18$与$g(f(x)) = 2x + 3$。",
},
{
"role": "assistant",
"content": "<analyze>
第一段内容正确。其准确阐述了已知函数与题目目标。题目给出$f(x) = x^2 - 7x + 18$与$g(f(x)) = 2x + 3$,用户需要求解$g(8)$所有可能取值的和。本段正确搭建了问题框架,无任何错误。
</analyze>
<verify>
python
# 此处无需直接计算;本分析仅用于理解问题框架。
# 由于无需进行计算,我们仅需确认函数表述无误。
print("函数表述与题目一致。")
[代码输出]
函数表述与题目一致。
</verify>
<output>
**判定**:$\boxed{Yes}$
</output>
",
},
{
"role": "user",
"content": "本题目标为求解$g(8)$所有可能取值的和。",
},
{
"role": "assistant",
"content": "<analyze>
第二段内容正确。题目要求求解$g(8)$所有可能取值的和,本段准确复述了题目目标,无任何偏差。
</analyze>
<verify>
python
# 此处无需计算,仅需复述题目目标。
print("题目目标复述正确。")
[代码输出]
题目目标复述正确。
</verify>
<output>
**判定**:$\boxed{Yes}$
</output>
",
},
...
]
# 引用说明
若本数据集对您的研究有所帮助,请引用我们的论文:
bibtex
@article{zhao2025genprm,
title = {GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning},
author = {Jian Zhao and Runze Liu and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
journal = {arXiv preprint arXiv:2504.00891},
year = {2025}
}
我们整理的过程奖励模型(Process Reward Model, PRM)合集见[Awesome-Process-Reward-Models](https://github.com/RyanLiu112/Awesome-Process-Reward-Models):
bibtex
@misc{Awesome-Process-Reward-Models,
title = {Awesome Process Reward Models},
author = {Runze Liu and Jian Zhao and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
howpublished = {url{https://github.com/RyanLiu112/Awesome-Process-Reward-Models}},
note = {GitHub repository},
year = {2025}
}
我们近期关于大语言模型(LLM)测试期缩放与过程奖励模型的研究:
bibtex
@article{liu2025can,
title = {Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling},
author = {Runze Liu and Junqi Gao and Jian Zhao and Kaiyan Zhang and Xiu Li and Biqing Qi and Wanli Ouyang and Bowen Zhou},
journal = {arXiv preprint arXiv:2502.06703},
year = {2025}
}
提供机构:
maas
创建时间:
2025-04-08



