AceMath-RewardBench
收藏魔搭社区2025-11-12 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/nv-community/AceMath-RewardBench
下载链接
链接失效反馈官方服务:
资源简介:
[website](https://research.nvidia.com/labs/adlr/acemath/) | [paper](https://arxiv.org/abs/2412.15084)
## AceMath-RewardBench Evaluation Dataset Card
The AceMath-RewardBench evaluation dataset evaluates capabilities of a math reward model using the best-of-N (N=8) setting for 7 datasets:
- **GSM8K**: 1319 questions
- **Math500**: 500 questions
- **Minerva Math**: 272 questions
- **Gaokao 2023 en**: 385 questions
- **OlympiadBench**: 675 questions
- **College Math**: 2818 questions
- **MMLU STEM**: 3018 questions
Each example in the dataset contains:
- A mathematical question
- 64 solution attempts with varying quality (8 each from Qwen2/2.5-Math-7/72B-Instruct, LLama3.1-8/70B-Instruct, Mathtral-7B-v0.1, deepseek-math-7b-instruct)
- Ground truth scores for each solution
- Additional metadata like problem difficulty and topic area
The evaluation benchmark focuses on two criteria:
- Diversity: each question is paired with 64 model responses generated from 8 different language models
- Robustness: the evaluation is conducted with 100 random seeds (randomly sampling 8 responses from 64 candidates with 100 random seeds) and reports average results
## Benchmark Results
| Model | GSM8K | MATH500 | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. |
|---------------------------|-------|---------|--------------|----------------|-----------------|--------------|-----------|--------|
| majority@8 | 96.22 | 83.11 | 41.20 | 68.21 | 42.69 | 45.01 | 78.21 | 64.95 |
| Skywork-o1-Open-PRM-Qwen-2.5-7B | 96.92 | 86.64 | 41.00 | 72.34 | 46.50 | 46.30 | 74.01 | 66.24 |
| Qwen2.5-Math-RM-72B | 96.61 | 86.63 | 43.60 | 73.62 | 47.21 | 47.29 | 84.24 | 68.46 |
| AceMath-7B-RM (Ours) | 96.66 | 85.47 | 41.96 | 73.82 | 46.81 | 46.37 | 80.78 | 67.41 |
| AceMath-72B-RM (Ours) | 97.23 | 86.72 | 45.06 | 74.69 | 49.23 | 46.79 | 87.01 | 69.53 |
*Reward model evaluation on [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench). The average results (rm@8) of reward models on math benchmarks, randomly sample 8 responses from 64 candidates with 100 random seeds. Response candidates are generated from a pool of 8 LLMs.
## How to use
```python
from datasets import load_dataset
# Load the dataset from Hugging Face Hub
dataset = load_dataset("nvidia/AceMath-RewardBench")
print(dataset.keys())
#dict_keys(['gsm8k', 'math500', 'minerva_math', 'gaokao2023en', 'olympiadbench', 'college_math', 'mmlu_stem'])
# Print the first example
print(dataset['gsm8k'][0].keys())
# dict_keys(['pred', 'question', 'score', 'report', 'idx', 'code', 'gt_cot', 'gt'])
# "question": The text of the mathematical problem
# "code": A list of complete model responses/solutions
# "gt": The ground truth answer
# "pred": A list of extracted predictions from each model response in "code"
# "score": A list of boolean values indicating whether each response matches the ground truth
```
## How to run evaluation
- requirement: vllm==0.6.6.post1 (for reward model batch inference)
- We provide the inference code (`inference_benchmark.py`) and evaluation script (`evaluate_orm.py`) in `scripts/`:
```bash
bash scripts/example_eval.sh
```
- Full prediction results are in scripts/orm_eval
## All Resources
### AceMath Instruction Models
- [AceMath-1.5B-Instruct](https://huggingface.co/nvidia/AceMath-1.5B-Instruct), [AceMath-7B-Instruct](https://huggingface.co/nvidia/AceMath-7B-Instruct), [AceMath-72B-Instruct](https://huggingface.co/nvidia/AceMath-72B-Instruct)
### AceMath Reward Models
- [AceMath-7B-RM](https://huggingface.co/nvidia/AceMath-7B-RM), [AceMath-72B-RM](https://huggingface.co/nvidia/AceMath-72B-RM)
### Evaluation & Training Data
- [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench), [AceMath-Instruct Training Data](https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data), [AceMath-RM Training Data](https://huggingface.co/datasets/nvidia/AceMath-RM-Training-Data)
### General Instruction Models
- [AceInstruct-1.5B](https://huggingface.co/nvidia/AceInstruct-1.5B), [AceInstruct-7B](https://huggingface.co/nvidia/AceInstruct-7B), [AceInstruct-72B](https://huggingface.co/nvidia/AceInstruct-72B)
## Correspondence to
Zihan Liu (zihanl@nvidia.com), Yang Chen (yachen@nvidia.com), Wei Ping (wping@nvidia.com)
## Citation
If you find our work helpful, we’d appreciate it if you could cite us.
<pre>
@article{acemath2024,
title={AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling},
author={Liu, Zihan and Chen, Yang and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
journal={arXiv preprint},
year={2024}
}
</pre>
## License
All models in the AceMath family are for non-commercial use only, subject to [Terms of Use](https://openai.com/policies/row-terms-of-use/) of the data generated by OpenAI. We put the AceMath models under the license of [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0).
[网站](https://research.nvidia.com/labs/adlr/acemath/) | [论文](https://arxiv.org/abs/2412.15084)
# AceMath-RewardBench 评测数据集卡片
本AceMath-RewardBench评测数据集采用N=8的最优8选(best-of-N)设置,针对7个数据集评估数学奖励模型(reward model)的性能:
- **GSM8K**:1319道题目
- **Math500**:500道题目
- **Minerva Math**:272道题目
- **2023年高考(英文卷)**:385道题目
- **OlympiadBench**:675道题目
- **College Math**:2818道题目
- **MMLU STEM**:3018道题目
数据集中的每个样本均包含:
- 一道数学问题
- 64份不同质量的解题尝试(其中8份分别来自Qwen2/2.5-Math-7/72B-Instruct、LLama3.1-8/70B-Instruct、Mathtral-7B-v0.1、deepseek-math-7b-instruct这8款大语言模型)
- 每份解题尝试的真实得分
- 额外元数据,如题目难度与主题领域
本评测基准围绕两项核心标准展开:
- 多样性:每个问题均搭配8款不同大语言模型生成的64份模型回复
- 鲁棒性:本次评测采用100个随机种子进行采样(从64份候选回复中随机抽取8份,共执行100次随机采样),最终报告平均评测结果
## 评测基准结果
| 模型 | GSM8K | Math500 | Minerva Math | 2023年高考(英文卷) | Olympiad Bench | 大学数学 | MMLU STEM | 平均值 |
|---------------------------|-------|---------|--------------|----------------|-----------------|--------------|-----------|--------|
| 多数投票@8 | 96.22 | 83.11 | 41.20 | 68.21 | 42.69 | 45.01 | 78.21 | 64.95 |
| Skywork-o1-Open-PRM-Qwen-2.5-7B | 96.92 | 86.64 | 41.00 | 72.34 | 46.50 | 46.30 | 74.01 | 66.24 |
| Qwen2.5-Math-RM-72B | 96.61 | 86.63 | 43.60 | 73.62 | 47.21 | 47.29 | 84.24 | 68.46 |
| AceMath-7B-RM(本文方法) | 96.66 | 85.47 | 41.96 | 73.82 | 46.81 | 46.37 | 80.78 | 67.41 |
| AceMath-72B-RM(本文方法) | 97.23 | 86.72 | 45.06 | 74.69 | 49.23 | 46.79 | 87.01 | 69.53 |
*本结果为基于AceMath-RewardBench的奖励模型评测结果,报告了各奖励模型在数学基准数据集上的平均评测结果(rm@8)。本次评测通过100个随机种子从64份候选回复中随机抽取8份,候选回复由8款大语言模型生成。
## 使用方法
python
from datasets import load_dataset
# 从Hugging Face Hub加载数据集
dataset = load_dataset("nvidia/AceMath-RewardBench")
print(dataset.keys())
#dict_keys(['gsm8k', 'math500', 'minerva_math', 'gaokao2023en', 'olympiadbench', 'college_math', 'mmlu_stem'])
# 打印第一个样本
print(dataset['gsm8k'][0].keys())
# dict_keys(['pred', 'question', 'score', 'report', 'idx', 'code', 'gt_cot', 'gt'])
# "question": 数学问题文本
# "code": 完整模型回复/解题方案列表
# "gt": 标准答案
# "pred": 从"code"中各模型回复提取的预测结果列表
# "score": 布尔值列表,用于指示每份回复是否与标准答案匹配
## 评测运行方法
- 依赖要求:vllm==0.6.6.post1(用于奖励模型批量推理)
- 我们在`scripts/`目录中提供了推理代码(`inference_benchmark.py`)与评测脚本(`evaluate_orm.py`):
bash
bash scripts/example_eval.sh
- 完整预测结果存放在`scripts/orm_eval`目录下
## 全部资源
### AceMath 指令模型
- [AceMath-1.5B-Instruct](https://huggingface.co/nvidia/AceMath-1.5B-Instruct), [AceMath-7B-Instruct](https://huggingface.co/nvidia/AceMath-7B-Instruct), [AceMath-72B-Instruct](https://huggingface.co/nvidia/AceMath-72B-Instruct)
### AceMath 奖励模型
- [AceMath-7B-RM](https://huggingface.co/nvidia/AceMath-7B-RM), [AceMath-72B-RM](https://huggingface.co/nvidia/AceMath-72B-RM)
### 评测与训练数据
- [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench), [AceMath-Instruct 训练数据](https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data), [AceMath-RM 训练数据](https://huggingface.co/datasets/nvidia/AceMath-RM-Training-Data)
### 通用指令模型
- [AceInstruct-1.5B](https://huggingface.co/nvidia/AceInstruct-1.5B), [AceInstruct-7B](https://huggingface.co/nvidia/AceInstruct-7B), [AceInstruct-72B](https://huggingface.co/nvidia/AceInstruct-72B)
## 通讯作者
Zihan Liu (zihanl@nvidia.com), Yang Chen (yachen@nvidia.com), Wei Ping (wping@nvidia.com)
## 引用方式
若您的工作用到了本数据集,恳请引用我们的工作:
<pre>
@article{acemath2024,
title={AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling},
author={Liu, Zihan and Chen, Yang and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
journal={arXiv preprint},
year={2024}
}
</pre>
## 许可协议
本AceMath系列所有模型仅可用于非商业用途,需遵守OpenAI生成数据的[使用条款](https://openai.com/policies/row-terms-of-use/)。AceMath系列模型采用[知识共享署名-非商业性使用4.0国际许可协议(CC-BY-NC-4.0)](https://spdx.org/licenses/CC-BY-NC-4.0)。
提供机构:
maas
创建时间:
2025-01-20



