five

AceMath-RewardBench

收藏
魔搭社区2025-11-12 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/nv-community/AceMath-RewardBench
下载链接
链接失效反馈
官方服务:
资源简介:
[website](https://research.nvidia.com/labs/adlr/acemath/) | [paper](https://arxiv.org/abs/2412.15084) ## AceMath-RewardBench Evaluation Dataset Card The AceMath-RewardBench evaluation dataset evaluates capabilities of a math reward model using the best-of-N (N=8) setting for 7 datasets: - **GSM8K**: 1319 questions - **Math500**: 500 questions - **Minerva Math**: 272 questions - **Gaokao 2023 en**: 385 questions - **OlympiadBench**: 675 questions - **College Math**: 2818 questions - **MMLU STEM**: 3018 questions Each example in the dataset contains: - A mathematical question - 64 solution attempts with varying quality (8 each from Qwen2/2.5-Math-7/72B-Instruct, LLama3.1-8/70B-Instruct, Mathtral-7B-v0.1, deepseek-math-7b-instruct) - Ground truth scores for each solution - Additional metadata like problem difficulty and topic area The evaluation benchmark focuses on two criteria: - Diversity: each question is paired with 64 model responses generated from 8 different language models - Robustness: the evaluation is conducted with 100 random seeds (randomly sampling 8 responses from 64 candidates with 100 random seeds) and reports average results ## Benchmark Results | Model | GSM8K | MATH500 | Minerva Math | GaoKao 2023 En | Olympiad Bench | College Math | MMLU STEM | Avg. | |---------------------------|-------|---------|--------------|----------------|-----------------|--------------|-----------|--------| | majority@8 | 96.22 | 83.11 | 41.20 | 68.21 | 42.69 | 45.01 | 78.21 | 64.95 | | Skywork-o1-Open-PRM-Qwen-2.5-7B | 96.92 | 86.64 | 41.00 | 72.34 | 46.50 | 46.30 | 74.01 | 66.24 | | Qwen2.5-Math-RM-72B | 96.61 | 86.63 | 43.60 | 73.62 | 47.21 | 47.29 | 84.24 | 68.46 | | AceMath-7B-RM (Ours) | 96.66 | 85.47 | 41.96 | 73.82 | 46.81 | 46.37 | 80.78 | 67.41 | | AceMath-72B-RM (Ours) | 97.23 | 86.72 | 45.06 | 74.69 | 49.23 | 46.79 | 87.01 | 69.53 | *Reward model evaluation on [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench). The average results (rm@8) of reward models on math benchmarks, randomly sample 8 responses from 64 candidates with 100 random seeds. Response candidates are generated from a pool of 8 LLMs. ## How to use ```python from datasets import load_dataset # Load the dataset from Hugging Face Hub dataset = load_dataset("nvidia/AceMath-RewardBench") print(dataset.keys()) #dict_keys(['gsm8k', 'math500', 'minerva_math', 'gaokao2023en', 'olympiadbench', 'college_math', 'mmlu_stem']) # Print the first example print(dataset['gsm8k'][0].keys()) # dict_keys(['pred', 'question', 'score', 'report', 'idx', 'code', 'gt_cot', 'gt']) # "question": The text of the mathematical problem # "code": A list of complete model responses/solutions # "gt": The ground truth answer # "pred": A list of extracted predictions from each model response in "code" # "score": A list of boolean values indicating whether each response matches the ground truth ``` ## How to run evaluation - requirement: vllm==0.6.6.post1 (for reward model batch inference) - We provide the inference code (`inference_benchmark.py`) and evaluation script (`evaluate_orm.py`) in `scripts/`: ```bash bash scripts/example_eval.sh ``` - Full prediction results are in scripts/orm_eval ## All Resources ### AceMath Instruction Models - [AceMath-1.5B-Instruct](https://huggingface.co/nvidia/AceMath-1.5B-Instruct), [AceMath-7B-Instruct](https://huggingface.co/nvidia/AceMath-7B-Instruct), [AceMath-72B-Instruct](https://huggingface.co/nvidia/AceMath-72B-Instruct) ### AceMath Reward Models - [AceMath-7B-RM](https://huggingface.co/nvidia/AceMath-7B-RM), [AceMath-72B-RM](https://huggingface.co/nvidia/AceMath-72B-RM) ### Evaluation & Training Data - [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench), [AceMath-Instruct Training Data](https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data), [AceMath-RM Training Data](https://huggingface.co/datasets/nvidia/AceMath-RM-Training-Data) ### General Instruction Models - [AceInstruct-1.5B](https://huggingface.co/nvidia/AceInstruct-1.5B), [AceInstruct-7B](https://huggingface.co/nvidia/AceInstruct-7B), [AceInstruct-72B](https://huggingface.co/nvidia/AceInstruct-72B) ## Correspondence to Zihan Liu (zihanl@nvidia.com), Yang Chen (yachen@nvidia.com), Wei Ping (wping@nvidia.com) ## Citation If you find our work helpful, we’d appreciate it if you could cite us. <pre> @article{acemath2024, title={AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling}, author={Liu, Zihan and Chen, Yang and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei}, journal={arXiv preprint}, year={2024} } </pre> ## License All models in the AceMath family are for non-commercial use only, subject to [Terms of Use](https://openai.com/policies/row-terms-of-use/) of the data generated by OpenAI. We put the AceMath models under the license of [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0).

[网站](https://research.nvidia.com/labs/adlr/acemath/) | [论文](https://arxiv.org/abs/2412.15084) # AceMath-RewardBench 评测数据集卡片 本AceMath-RewardBench评测数据集采用N=8的最优8选(best-of-N)设置,针对7个数据集评估数学奖励模型(reward model)的性能: - **GSM8K**:1319道题目 - **Math500**:500道题目 - **Minerva Math**:272道题目 - **2023年高考(英文卷)**:385道题目 - **OlympiadBench**:675道题目 - **College Math**:2818道题目 - **MMLU STEM**:3018道题目 数据集中的每个样本均包含: - 一道数学问题 - 64份不同质量的解题尝试(其中8份分别来自Qwen2/2.5-Math-7/72B-Instruct、LLama3.1-8/70B-Instruct、Mathtral-7B-v0.1、deepseek-math-7b-instruct这8款大语言模型) - 每份解题尝试的真实得分 - 额外元数据,如题目难度与主题领域 本评测基准围绕两项核心标准展开: - 多样性:每个问题均搭配8款不同大语言模型生成的64份模型回复 - 鲁棒性:本次评测采用100个随机种子进行采样(从64份候选回复中随机抽取8份,共执行100次随机采样),最终报告平均评测结果 ## 评测基准结果 | 模型 | GSM8K | Math500 | Minerva Math | 2023年高考(英文卷) | Olympiad Bench | 大学数学 | MMLU STEM | 平均值 | |---------------------------|-------|---------|--------------|----------------|-----------------|--------------|-----------|--------| | 多数投票@8 | 96.22 | 83.11 | 41.20 | 68.21 | 42.69 | 45.01 | 78.21 | 64.95 | | Skywork-o1-Open-PRM-Qwen-2.5-7B | 96.92 | 86.64 | 41.00 | 72.34 | 46.50 | 46.30 | 74.01 | 66.24 | | Qwen2.5-Math-RM-72B | 96.61 | 86.63 | 43.60 | 73.62 | 47.21 | 47.29 | 84.24 | 68.46 | | AceMath-7B-RM(本文方法) | 96.66 | 85.47 | 41.96 | 73.82 | 46.81 | 46.37 | 80.78 | 67.41 | | AceMath-72B-RM(本文方法) | 97.23 | 86.72 | 45.06 | 74.69 | 49.23 | 46.79 | 87.01 | 69.53 | *本结果为基于AceMath-RewardBench的奖励模型评测结果,报告了各奖励模型在数学基准数据集上的平均评测结果(rm@8)。本次评测通过100个随机种子从64份候选回复中随机抽取8份,候选回复由8款大语言模型生成。 ## 使用方法 python from datasets import load_dataset # 从Hugging Face Hub加载数据集 dataset = load_dataset("nvidia/AceMath-RewardBench") print(dataset.keys()) #dict_keys(['gsm8k', 'math500', 'minerva_math', 'gaokao2023en', 'olympiadbench', 'college_math', 'mmlu_stem']) # 打印第一个样本 print(dataset['gsm8k'][0].keys()) # dict_keys(['pred', 'question', 'score', 'report', 'idx', 'code', 'gt_cot', 'gt']) # "question": 数学问题文本 # "code": 完整模型回复/解题方案列表 # "gt": 标准答案 # "pred": 从"code"中各模型回复提取的预测结果列表 # "score": 布尔值列表,用于指示每份回复是否与标准答案匹配 ## 评测运行方法 - 依赖要求:vllm==0.6.6.post1(用于奖励模型批量推理) - 我们在`scripts/`目录中提供了推理代码(`inference_benchmark.py`)与评测脚本(`evaluate_orm.py`): bash bash scripts/example_eval.sh - 完整预测结果存放在`scripts/orm_eval`目录下 ## 全部资源 ### AceMath 指令模型 - [AceMath-1.5B-Instruct](https://huggingface.co/nvidia/AceMath-1.5B-Instruct), [AceMath-7B-Instruct](https://huggingface.co/nvidia/AceMath-7B-Instruct), [AceMath-72B-Instruct](https://huggingface.co/nvidia/AceMath-72B-Instruct) ### AceMath 奖励模型 - [AceMath-7B-RM](https://huggingface.co/nvidia/AceMath-7B-RM), [AceMath-72B-RM](https://huggingface.co/nvidia/AceMath-72B-RM) ### 评测与训练数据 - [AceMath-RewardBench](https://huggingface.co/datasets/nvidia/AceMath-RewardBench), [AceMath-Instruct 训练数据](https://huggingface.co/datasets/nvidia/AceMath-Instruct-Training-Data), [AceMath-RM 训练数据](https://huggingface.co/datasets/nvidia/AceMath-RM-Training-Data) ### 通用指令模型 - [AceInstruct-1.5B](https://huggingface.co/nvidia/AceInstruct-1.5B), [AceInstruct-7B](https://huggingface.co/nvidia/AceInstruct-7B), [AceInstruct-72B](https://huggingface.co/nvidia/AceInstruct-72B) ## 通讯作者 Zihan Liu (zihanl@nvidia.com), Yang Chen (yachen@nvidia.com), Wei Ping (wping@nvidia.com) ## 引用方式 若您的工作用到了本数据集,恳请引用我们的工作: <pre> @article{acemath2024, title={AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling}, author={Liu, Zihan and Chen, Yang and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei}, journal={arXiv preprint}, year={2024} } </pre> ## 许可协议 本AceMath系列所有模型仅可用于非商业用途,需遵守OpenAI生成数据的[使用条款](https://openai.com/policies/row-terms-of-use/)。AceMath系列模型采用[知识共享署名-非商业性使用4.0国际许可协议(CC-BY-NC-4.0)](https://spdx.org/licenses/CC-BY-NC-4.0)。
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作