reward-bench-2

Name: reward-bench-2
Creator: maas
Published: 2026-01-06 16:34:35
License: 暂无描述

魔搭社区2026-01-06 更新2025-06-07 收录

下载链接：

https://modelscope.cn/datasets/allenai/reward-bench-2

下载链接

链接失效反馈

官方服务：

资源简介：

[Code](https://github.com/allenai/reward-bench) | [Leaderboard](https://huggingface.co/spaces/allenai/reward-bench) | [Results](https://huggingface.co/datasets/allenai/reward-bench-2-results) | [Paper](https://arxiv.org/abs/2506.01937) # RewardBench 2 Evaluation Dataset Card The RewardBench 2 evaluation dataset is the new version of RewardBench that is based on unseen human data and designed to be substantially more difficult! RewardBench 2 evaluates capabilities of reward models over the following categories: 1. **Factuality** (*NEW!*): Tests the ability of RMs to detect hallucinations and other basic errors in completions. 2. **Precise Instruction Following** (*NEW!*): Tests the ability of RMs to judge whether text follows precise instructions, such as "Answer without the letter u". 3. **Math**: Tests RMs' abilities at math, on open-ended human prompts ranging from middle school physics and geometry to college-level chemistry, calculus, combinatorics, and more. 4. **Safety**: Tests RMs' abilities to correctly comply with or refuse prompts related to harmful use cases as well as general compliance behaviors. 5. **Focus**: Tests RMs' ability to detect high-quality, on-topic answers to general user queries. 6. **Ties** (*NEW*!): This new type of subset tests the robustness of RMs in domains with many possible similar answers. For example, the question "Name a color of the rainbow" has seven possible correct answers and infinitely many incorrect ones. The RewardBench 2 leaderboard averages over these six subsets. For the first five categories, the scoring for RewardBench 2 evaluates success as whether the score of a prompt-chosen pair is greater than the score of *three* prompt-rejected pairs. The "Ties" score is a weighted score of accuracy (as measured by *all* valid correct answers being scored higher than *all* incorrect answers) and whether the reward margin between correct and incorrect answers exceeds that of the highest and lowest-scored correct responses. This metric rewards not only correctness, but also a model's ability to prioritize correct answers over incorrect ones more strongly than it distinguishes between equally valid correct responses. <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/main-fig-hor.png" alt="RewardBench 2 Flow" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> ## Dataset Construction Summary | Domain | Count | Prompt Source | Method of generating completions | Completion Filtering | |--------|-------|---------------|----------------------------------|---------------------| | Factuality | 475 | Human | Both | Multi-LM-as-a-judge | | Precise IF | 160 | Human | Natural | Verifier functions | | Math | 183 | Human | Natural | Majority voting | | Safety | 450 | CoCoNot | Both | LM-as-a-judge & rubrics | | Focus | 495 | Human | System Prompt Variation | N/A | | Ties | 102 | Manual | System Prompt Variation | Manual verification | ## Dataset Details Each sample in the dataset has the following items. Note, the dataset is single-turn: * `prompt` (`str`): the instruction given in the various test sets. * `chosen` (`list[str]`): the chosen response(s) (1 chosen response for all subsets but ties) * `rejected` (`list[str]`): the rejected responses (3 chosen responses for all subsets but ties) * `num_correct` (`int`): the number of chosen responses * `num_rejected` (`int`): the number of rejected responses * `total_completions` (`int`): the total number of responses * `models` (`list[str]`): a list of models that the chosen and rejected responses are generated from, respectively * `subset` (`str`): the subset the datapoint is part of. * `id` (`int`): an incremented id for every prompt in the benchmark. To select a specific subset use HuggingFace Datasets `.filter` functionality. ``` dataset = dataset.filter(lambda ex: ex["subset"] == "Factuality") ``` ## Models Used We generated completions from the following models: - [Mistral 7B Instruct v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3) (Apache 2.0) - [Tulu 3 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) (Llama 3.1 Community License Agreement) - [Tulu 3 70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) (Llama 3.1 Community License Agreement) - [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (Llama 3.1 Community License Agreement) - [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (Llama 3.1 Community License Agreement) - [Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) (Llama 3.2 Community License Agreement) - [Llama 2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) (Llama 2 Community License Agreement) - [Tulu 2 70B](https://huggingface.co/allenai/tulu-2-dpo-70b) (Ai2 ImpACT Low Risk License) - [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) (Qwen License Agreement) - [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) (Apache 2.0) - [Qwen2.5 14B Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache 2.0) - [Qwen2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) (Apache 2.0) - [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) (Qwen License Agreement) - [Qwen2.5 Math 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) (Apache 2.0) - [Deepseek Math 7B RL](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl) (This model is licensed under the Deepseek License. Any use of the outputs from this model must be in accordance with the use restrictions in the [Deepseek License](https://github.com/deepseek-ai/DeepSeek-Math/blob/main/LICENSE-MODEL).) - [OLMoE 1B 7B 0924 Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924) (Apache 2.0) - [Dolphin 2.0 Mistral 7b](https://huggingface.co/cognitivecomputations/dolphin-2.0-mistral-7b) (Apache 2.0) - [Zephyr 7b Beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) (MIT License) - GPT-4o (Outputs produced by GPT-4 are subject to OpenAI's [terms of use](https://openai.com/policies/row-terms-of-use/)) - Claude 3.5 Sonnet (Outputs produced by Claude are subject to Anthropic [terms of service](https://www.anthropic.com/legal/consumer-terms) and [usage policy](https://www.anthropic.com/legal/aup)) ## License This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use). This dataset includes output data generated from third party models that are subject to separate terms governing their use. ## Trained Reward Models We also trained and released several reward models— check out the [RewardBench 2 Collection](https://huggingface.co/collections/allenai/reward-bench-2-683d2612a4b3e38a3e53bb51) to use them! ## Citation ``` @misc{malik2025rewardbench2advancingreward, title={RewardBench 2: Advancing Reward Model Evaluation}, author={Saumya Malik and Valentina Pyatkin and Sander Land and Jacob Morrison and Noah A. Smith and Hannaneh Hajishirzi and Nathan Lambert}, year={2025}, eprint={2506.01937}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.01937}, } ```

[代码](https://github.com/allenai/reward-bench) | [排行榜](https://huggingface.co/spaces/allenai/reward-bench) | [评测结果](https://huggingface.co/datasets/allenai/reward-bench-2-results) | [论文](https://arxiv.org/abs/2506.01937) # RewardBench 2 评估数据集卡片 RewardBench 2（RewardBench 2）评估数据集是RewardBench（RewardBench）的全新版本，基于未公开的人类标注数据，且难度显著提升！RewardBench 2从以下六个维度评估奖励模型（Reward Model, RM）的性能： 1. **事实性（Factuality）（新增！）**：评估奖励模型检测模型输出中的幻觉及其他基础错误的能力。 2. **精准指令遵循（Precise Instruction Following）（新增！）**：评估奖励模型判断文本是否符合精准指令的能力，例如“回答中不包含字母u”。 3. **数学（Math）**：针对开放式人类提示词评估奖励模型的数学能力，提示词范围涵盖从初中物理、几何到大学级别的化学、微积分、组合数学等多个领域。 4. **安全性（Safety）**：评估奖励模型对有害用途相关提示词的正确响应（合规或拒绝）能力，以及通用合规行为的判断能力。 5. **相关性（Focus）**：评估奖励模型识别通用用户查询的高质量贴合主题回答的能力。 6. **多解场景（Ties）（新增！）**：该全新子集用于评估奖励模型在存在大量相似可选答案的领域中的鲁棒性。例如，问题“说出彩虹的一种颜色”存在7种正确答案，以及无限多种错误答案。 RewardBench 2排行榜的总得分为上述6个子集的平均得分。在前五个类别中，RewardBench 2的评分规则为：若单个提示词-优选回答对的得分高于该提示词对应的3个提示词-拒选回答对的得分，则判定为成功。 “多解场景”子任务的得分为准确率与奖励间隔的加权得分：其中准确率要求所有合法正确答案的得分均高于所有错误答案的得分；奖励间隔则指正确与错误答案的得分差值需高于最高分正确答案与最低分正确答案的得分差值。该指标不仅奖励模型的回答分类正确性，同时激励模型对正确答案的偏好程度高于对等价合法正确答案的区分度。 <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/reward-bench/main-fig-hor.png" alt="RewardBench 2 Flow" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> ## 数据集构建概览 | 任务领域 | 样本数量 | 提示词来源 | 回答生成方法 | 回答筛选方法 | |--------|-------|---------------|----------------------------------|---------------------| | 事实性 | 475 | 人类标注 | 混合生成 | 多模型裁判筛选 | | 精准指令遵循 | 160 | 人类标注 | 自然生成 | 验证函数筛选 | | 数学 | 183 | 人类标注 | 自然生成 | 多数投票筛选 | | 安全性 | 450 | CoCoNot | 混合生成 | 模型裁判与评分准则筛选 | | 相关性 | 495 | 人类标注 | 系统提示词变体生成 | 无 | | 多解场景 | 102 | 人工编写 | 系统提示词变体生成 | 人工验证 | ## 数据集详情本数据集的每条样本包含以下字段，且该数据集为单轮对话数据集： * `prompt`（字符串类型）：各测试集所使用的指令。 * `chosen`（字符串列表）：优选回答（除多解场景子集外，其余所有子集均仅包含1个优选回答） * `rejected`（字符串列表）：拒选回答（除多解场景子集外，其余所有子集均包含3个拒选回答） * `num_correct`（整数类型）：优选回答的数量 * `num_rejected`（整数类型）：拒选回答的数量 * `total_completions`（整数类型）：该样本包含的总回答数 * `models`（字符串列表）：分别生成优选回答与拒选回答的模型列表 * `subset`（字符串类型）：该数据点所属的子任务子集 * `id`（整数类型）：该基准测试中每条提示词的递增唯一标识符若需筛选特定子集，可使用HuggingFace Datasets库的`.filter`方法，示例如下： dataset = dataset.filter(lambda ex: ex["subset"] == "Factuality") ## 所用模型本数据集的回答由以下模型生成： - [Mistral 7B Instruct v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3)（Apache 2.0许可证） - [Tulu 3 8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B)（Llama 3.1社区许可协议） - [Tulu 3 70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B)（Llama 3.1社区许可协议） - [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)（Llama 3.1社区许可协议） - [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)（Llama 3.1社区许可协议） - [Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)（Llama 3.2社区许可协议） - [Llama 2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)（Llama 2社区许可协议） - [Tulu 2 70B](https://huggingface.co/allenai/tulu-2-dpo-70b)（Ai2低风险许可协议） - [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct)（Qwen许可协议） - [Qwen2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)（Apache 2.0许可证） - [Qwen2.5 14B Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)（Apache 2.0许可证） - [Qwen2.5 0.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)（Apache 2.0许可证） - [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct)（Qwen许可协议） - [Qwen2.5 Math 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)（Apache 2.0许可证） - [Deepseek Math 7B RL](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl)（本模型遵循Deepseek许可协议，使用该模型生成的输出需符合[Deepseek许可协议](https://github.com/deepseek-ai/DeepSeek-Math/blob/main/LICENSE-MODEL)中的使用限制） - [OLMoE 1B 7B 0924 Instruct](https://huggingface.co/allenai/OLMoE-1B-7B-0924)（Apache 2.0许可证） - [Dolphin 2.0 Mistral 7b](https://huggingface.co/cognitivecomputations/dolphin-2.0-mistral-7b)（Apache 2.0许可证） - [Zephyr 7b Beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)（MIT许可证） - GPT-4o：输出由GPT-4生成，需遵循OpenAI的[使用条款](https://openai.com/policies/row-terms-of-use/) - Claude 3.5 Sonnet：输出由Claude生成，需遵循Anthropic的[服务条款](https://www.anthropic.com/legal/consumer-terms)与[使用政策](https://www.anthropic.com/legal/aup) ## 许可证本数据集采用ODC-BY许可证进行授权，仅可用于研究与教育用途，并需遵循艾伦人工智能研究所（Allen Institute for AI, Ai2）的[负责任使用指南](https://allenai.org/responsible-use)。本数据集包含第三方模型生成的输出数据，此类数据的使用需遵循对应模型的单独使用条款。 ## 训练后的奖励模型我们还训练并发布了多款奖励模型，可访问[RewardBench 2 模型集合](https://huggingface.co/collections/allenai/reward-bench-2-683d2612a4b3e38a3e53bb51)获取使用！ ## 引用格式 @misc{malik2025rewardbench2advancingreward, title={RewardBench 2: Advancing Reward Model Evaluation}, author={Saumya Malik and Valentina Pyatkin and Sander Land and Jacob Morrison and Noah A. Smith and Hannaneh Hajishirzi and Nathan Lambert}, year={2025}, eprint={2506.01937}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.01937}, }

提供机构：

maas

创建时间：

2025-06-03

搜集汇总

数据集介绍