RationaleRM
收藏魔搭社区2026-05-09 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/Qwen/RationaleRM
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<p align="right">
<strong>English</strong> | <a href="./README_zh.md">中文</a>
</p>
<h1>Outcome Accuracy is Not Enough:<br/> Aligning the Reasoning Process of Reward Models</h1>
<p align="center">
<a href="https://arxiv.org/abs/2602.04649"><img src="https://img.shields.io/badge/arXiv-2602.04649-b31b1b.svg" alt="arXiv"></a>
<a href="https://huggingface.co/datasets/Qwen/RationaleRM"><img src="https://img.shields.io/badge/🤗%20Dataset-RationaleRM-yellow" alt="Dataset"></a>
<a href="https://modelscope.cn/datasets/Qwen/RationaleRM"><img src="https://img.shields.io/badge/🤖%20Dataset-RationaleRM-blue" alt="Dataset"></a>
<a href="https://creativecommons.org/licenses/by/4.0/legalcode.en"><img src="https://img.shields.io/badge/License-CC%20BY%204.0-green.svg" alt="License"></a>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2602.04649"><strong>[📄 Paper]</strong></a> •
<a href="#dataset"><strong>[🤗 Dataset]</strong></a> •
<a href="#citation"><strong>[📜 Citation]</strong></a>
</p>
<p align="center">
<img src="images/overall_compare.png" alt="Outcome Accuracy vs Rationale Consistency" width="70%">
</p>
<p align="center"><em>Outcome Accuracy vs Rationale Consistency: Rationale Consistency effectively distinguishes frontier models and detects deceptive alignment</em></p>
</div>
---
## 📖 Overview
**RationaleRM** is a research project that investigates how to align not just the *outcomes* but also the *reasoning processes* of reward models with human judgments. We discover that generative reward models (GenRMs) and LLM-as-a-Judge exhibit **Deceptive Alignment** issues — models may reach the same final result as humans through superficial or even incorrect reasoning processes.
To address this, we propose the **Rationale Consistency** metric, which measures the alignment between the model's reasoning process and human judgment rationales. We also design the **MetaJudge** framework to compute this metric: it decomposes human and model rationales into atomic units, then performs strict one-to-one semantic matching to precisely quantify their consistency.
**Core Contributions:**
- 🔍 **MetaJudge Framework**: Decomposes human rationales into atomic units and uses LLMs for strict one-to-one semantic matching
- 📊 **Rationale Consistency Metric**: Effectively detects deceptive alignment and distinguishes frontier models (e.g., GPT-5 or Gemini 3 Pro)
- 🛠️ **Hybrid Reward Training**: Combines rationale reward (Average Precision) and outcome reward to prevent "rationale degeneration"
- 🏆 **SOTA Performance**: Achieves best results on RM-Bench (87.1%) and JudgeBench (82.0%)
---
## 🚨 Key Finding: The Deceptive Alignment Trap
We evaluated 19 frontier models and found two critical flaws when relying solely on outcome accuracy:
### Outcome Accuracy Cannot Distinguish Frontier Models
In the green region, although multiple models achieve similar outcome accuracy, rationale consistency clearly distinguishes stronger models (such as GPT-5, o3, Gemini 3 Pro) from weaker ones (such as Claude 3.5, GPT-4.1).
### Outcome Accuracy Cannot Detect Deceptive Alignment
The most typical example is the comparison between **o3 and o3-mini**: both have similar outcome accuracy, but o3-mini's rationale consistency is nearly 50% lower. o3-mini relies on surface cues (such as formatting, emojis) to make judgments, while o3 performs rigorous fact-checking like humans do.
> 💡 **Key Insight**: Models can make correct choices for wrong reasons. Outcome accuracy alone cannot detect this deceptive alignment.
---
## 📉 Training Finding: Outcome-Only Supervision Leads to Rationale Degeneration
<p align="center">
<img src="images/reward_compare.png" alt="Training Dynamics" width="70%">
</p>
<p align="center"><em>Training dynamics comparison: Similar outcome rewards, but significantly different rationale rewards</em></p>
The figure above shows a key finding during training: **outcome-only supervision leads to continuous decline in model-human reasoning process consistency**.
- **Left**: Both methods achieve nearly identical outcome rewards, indicating models can learn to select correct answers
- **Right**: Rationale rewards show significant divergence — without rationale consistency constraints, model rationale rewards continuously decline, ultimately **24.2%** lower than our method
This reveals the **Rationale Degeneration** phenomenon: when intermediate reasoning processes are not incentivized, models abandon high-cost evidence verification and instead rely on cheaper surface cues to achieve similar outcome rewards.
---
## 🏆 Main Results
We evaluate on two challenging benchmarks:
- **RM-Bench**: Evaluates model ability to distinguish subtle differences and style biases
- **JudgeBench**: Emphasizes deep judgment and logical reasoning
| Model | RM-Bench | JudgeBench | Avg |
| :------------------------------------- | :------------: | :------------: | :-----------: |
| **Generative Reward Models** | | | |
| RM-R1-Distilled-Qwen-32B | 83.9 | 78.8 | 81.4 |
| RRM-32B | 73.1 | 75.7 | 74.4 |
| Nemotron-Super-49B | 82.7 | 77.2 | 80.0 |
| RewardAnything-8B-v1 | 83.1 | 62.6 | 72.9 |
| GRAM-R² | 85.7 | 81.0 | 83.4 |
| **Outcome-Only Baselines** | | | |
| Qwen3-14B (Outcome-Only) | 83.6 | 70.0 | 76.8 |
| Qwen3-30B-A3B (Outcome-Only) | 84.9 | 75.7 | 80.3 |
| **Our Method (Outcome + Rationale)** | | | |
| Qwen3-14B (Ours) | 86.7 | 79.1 | 82.9 |
| **Qwen3-30B-A3B (Ours)** | **87.1** | **82.0** | **84.6** |
> 💡 Our method effectively reverses the rationale consistency decline observed during outcome-only training (from 25% to 37%).
---
## 🚀 Quick Start
### Project Structure
```
RationaleRM/
├── metajudge_infer.py # Semantic matching inference script
├── metajudge_infer.sh # Shell script for running inference
├── metajudge_analysis.py # Analysis script for computing metrics
├── images/ # Images
│ ├── overall_compare.png
│ └── reward_compare.png
├── data/ # Datasets
│ ├── helpsteer3_test_1000.jsonl # Test set: 1000 samples
│ └── helpsteer3_human_checklist.jsonl # Full dataset (22,116 samples)
└── example/ # Example data for testing
├── infer_input_10samples.jsonl
├── model-low_deceptive_alignment.jsonl
└── model-high_deceptive_alignment.jsonl
```
### Step 1: Prepare Data
Input data should be in JSONL format with the following fields:
- `human-checklist`: List of human atomic rationales (reference)
- `{model}-checklist`: List of model-generated atomic rationales to be evaluated
Example:
```json
{
"domain": "general",
"context": [...],
"response1": "...",
"response2": "...",
"human-checklist": [
"Response 1 lacks polysyllabic rhymes",
"Response 2's meter is inconsistent"
],
"model-low_deceptive_alignment-checklist": [
"Response A's rhyme scheme is forced",
"Response B's rhythm feels awkward"
]
}
```
### Step 2: Run Inference
The inference script evaluates how well each model-generated checklist item matches the human checklist:
```bash
# Set environment variables
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.openai.com/v1" # Optional, defaults to OpenAI
# Run inference
python metajudge_infer.py \
--input-file data/helpsteer3_test_1000.jsonl \
--output-file output/results.jsonl \
--model gpt-4o \
--model-be-evaluated model-low_deceptive_alignment \
--concurrent-requests 5
```
Or use the shell script:
```bash
bash metajudge_infer.sh
```
Key parameters:
- `--input-file`: Path to input JSONL file
- `--output-file`: Path for output results
- `--model`: LLM model for semantic matching (e.g., gpt-4o, qwen-plus)
- `--model-be-evaluated`: The critic model whose checklist will be evaluated
- `--concurrent-requests`: Number of parallel API requests
API configuration (via environment variables or command line):
- `OPENAI_API_KEY` or `--api-key`: API key for the LLM service
- `OPENAI_BASE_URL` or `--api-base`: API base URL (default: https://api.openai.com/v1)
### Step 3: Analyze Results
Compute Precision, Recall, F1, and Average Precision:
```bash
# Analyze single file
python metajudge_analysis.py \
--input-file example/low_deceptive_alignment_infer_output.jsonl \
--model-be-evaluated model-low_deceptive_alignment
# Analyze all files in a directory
python metajudge_analysis.py \
--input-dir example/ \
--sort-by recall
```
Output example:
```text
====================================================================================================
Results Sorted by RECALL
====================================================================================================
Model Precision Recall F1 AP Valid
----------------------------------------------------------------------------------------------------
model-low_deceptive_alignment 0.3300 0.4297 0.3684 0.3991 10
model-high_deceptive_alignment 0.1850 0.2242 0.1985 0.2376 10
====================================================================================================
```
---
## 📊 Metrics
MetaJudge computes the following metrics:
| Metric | Description |
|--------|-------------|
| **Recall** | Proportion of human rationales matched by model rationales |
| **Precision** | Proportion of model rationales that match human rationales (for evaluation) |
| **F1** | Harmonic mean of Precision and Recall |
| **Average Precision (AP)** | Used for training in this paper |
---
<a id="dataset"></a>
## 📂 Dataset
We provide two datasets:
### 1. HelpSteer3 Human Checklist (Full Dataset)
**`helpsteer3_human_checklist.jsonl`** contains the complete HelpSteer3 dataset with human-annotated atomic rationales, suitable for training.
### 2. Test Set (with Model Checklists)
**`helpsteer3_test_1000.jsonl`** contains 1000 selected test samples used for testing in the paper. We provide two model checklists representing different levels of deceptive alignment:
| Field | Description |
|-------|-------------|
| `human-checklist` | Human-annotated atomic rationales (reference) |
| `model-low_deceptive_alignment-checklist` | Low deceptive alignment model checklist (corresponds to high Rationale Consistency in the paper) |
| `model-low_deceptive_alignment-label` | Low deceptive alignment model preference label |
| `model-low_deceptive_alignment-generated_text` | Low deceptive alignment model full generated text |
| `model-high_deceptive_alignment-checklist` | High deceptive alignment model checklist (corresponds to low Rationale Consistency in the paper) |
| `model-high_deceptive_alignment-label` | High deceptive alignment model preference label |
| `model-high_deceptive_alignment-generated_text` | High deceptive alignment model full generated text |
> **Note:**
> - Atomic rationales were generated using GPT-5 for research purposes only.
> - The `model-high_deceptive_alignment` and `model-low_deceptive_alignment` data are provided for testing/evaluation purposes only and were not used for training.
---
<a id="citation"></a>
## 📜 Citation
If you find this work helpful, please cite our paper:
```bibtex
@article{wang2026outcome,
title={Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models},
author={Wang, Binghai and Liu, Yantao and Liu, Yuxuan and Tang, Tianyi and Wang, Shenzhi and Gao, Chang and Zheng, Chujie and Zhang, Yichang and Yu, Le and Liu, Shixuan and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Yu, Bowen and Huang, Fei and Lin, Junyang},
journal={arXiv preprint arXiv:2602.04649},
year={2026}
}
```
---
<div align="center">
**Developed by Qwen Team in collaboration with Fudan University**
</div>
<div align="center">
<p align="right">
<strong>English</strong> | <a href="./README_zh.md">中文</a>
</p>
<h1>仅结果准确率不足:对齐奖励模型的推理过程</h1>
<p align="center">
<a href="https://arxiv.org/abs/2602.04649"><img src="https://img.shields.io/badge/arXiv-2602.04649-b31b1b.svg" alt="arXiv"></a>
<a href="https://huggingface.co/datasets/Qwen/RationaleRM"><img src="https://img.shields.io/badge/🤗%20Dataset-RationaleRM-yellow" alt="Dataset"></a>
<a href="https://modelscope.cn/datasets/Qwen/RationaleRM"><img src="https://img.shields.io/badge/🤖%20Dataset-RationaleRM-blue" alt="Dataset"></a>
<a href="https://creativecommons.org/licenses/by/4.0/legalcode.en"><img src="https://img.shields.io/badge/License-CC%20BY%204.0-green.svg" alt="License"></a>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2602.04649"><strong>[📄 论文]</strong></a> •
<a href="#dataset"><strong>[🤗 数据集]</strong></a> •
<a href="#citation"><strong>[📜 引用]</strong></a>
</p>
<p align="center">
<img src="images/overall_compare.png" alt="结果准确率与推理一致性对比" width="70%">
</p>
<p align="center"><em>结果准确率与推理一致性:推理一致性可有效区分前沿模型并检测欺骗性对齐</em></p>
</div>
---
## 📖 项目概述
**RationaleRM** 是一项研究项目,旨在探索如何将奖励模型不仅在*输出结果*层面,更在*推理过程*层面与人类判断对齐。我们发现,生成式奖励模型(Generative Reward Models, GenRMs)与大语言模型作为评判者(LLM-as-a-Judge)存在**欺骗性对齐(Deceptive Alignment)**问题——模型可能通过表面化甚至错误的推理过程,得出与人类一致的最终结果。
为解决这一问题,我们提出了**推理一致性(Rationale Consistency)**指标,用于衡量模型推理过程与人类判断依据的对齐程度;同时设计了**MetaJudge**框架来计算该指标:该框架将人类与模型的推理依据拆解为原子单元,通过严格的一对一语义匹配,精准量化二者的一致性。
**核心贡献:**
- 🔍 **MetaJudge框架**:将人类推理依据拆解为原子单元,借助大语言模型完成严格的一对一语义匹配
- 📊 **推理一致性指标**:可有效检测欺骗性对齐,并区分前沿模型(如GPT-5、Gemini 3 Pro)
- 🛠️ **混合奖励训练**:结合推理奖励(平均精度)与结果奖励,防止“推理退化”
- 🏆 **最优性能(SOTA)**:在RM-Bench(87.1%)与JudgeBench(82.0%)上取得最佳结果
---
## 🚨 关键发现:欺骗性对齐陷阱
我们评估了19款前沿模型,发现仅依赖结果准确率存在两处致命缺陷:
### 结果准确率无法区分前沿模型
在绿色区域中,尽管多款模型的结果准确率相近,但推理一致性可清晰区分更强模型(如GPT-5、o3、Gemini 3 Pro)与较弱模型(如Claude 3.5、GPT-4.1)。
### 结果准确率无法检测欺骗性对齐
最典型的案例是**o3与o3-mini**的对比:二者结果准确率相近,但o3-mini的推理一致性比前者低近50%。o3-mini依赖表面线索(如格式、表情符号)进行判断,而o3则如人类一般开展严谨的事实核查。
> 💡 **核心洞察**:模型可能因错误理由做出正确选择,仅靠结果准确率无法检测此类欺骗性对齐。
---
## 📉 训练发现:仅结果监督导致推理退化
<p align="center">
<img src="images/reward_compare.png" alt="训练动态对比" width="70%">
</p>
<p align="center"><em>训练动态对比:结果奖励相近,但推理奖励差异显著</em></p>
上图展示了训练过程中的关键发现:**仅结果监督会导致模型与人类推理过程的一致性持续下降**。
- **左侧**:两种方法的结果奖励几乎一致,表明模型可学会选择正确答案
- **右侧**:推理奖励出现显著分化——若无推理一致性约束,模型的推理奖励会持续下滑,最终比我们的方法低**24.2%**
这揭示了**推理退化**现象:当中间推理过程未被纳入激励范畴时,模型会放弃高成本的证据验证,转而依赖低成本的表面线索,以获取相近的结果奖励。
---
## 🏆 主要实验结果
我们在两项极具挑战性的基准上开展评估:
- **RM-Bench**:评估模型区分细微差异与风格偏见的能力
- **JudgeBench**:侧重深度判断与逻辑推理能力
| 模型 | RM-Bench | JudgeBench | 平均得分 |
| :------------------------------------- | :------------: | :------------: | :-----------: |
| **生成式奖励模型** | | | |
| RM-R1-Distilled-Qwen-32B | 83.9 | 78.8 | 81.4 |
| RRM-32B | 73.1 | 75.7 | 74.4 |
| Nemotron-Super-49B | 82.7 | 77.2 | 80.0 |
| RewardAnything-8B-v1 | 83.1 | 62.6 | 72.9 |
| GRAM-R² | 85.7 | 81.0 | 83.4 |
| **仅结果监督基线模型** | | | |
| Qwen3-14B (仅结果监督) | 83.6 | 70.0 | 76.8 |
| Qwen3-30B-A3B (仅结果监督) | 84.9 | 75.7 | 80.3 |
| **我们的方法(结果+推理监督)** | | | |
| Qwen3-14B (本文方法) | 86.7 | 79.1 | 82.9 |
| **Qwen3-30B-A3B (本文方法)** | **87.1** | **82.0** | **84.6** |
> 💡 我们的方法可有效逆转仅结果训练中出现的推理一致性下降问题(从25%提升至37%)。
---
## 🚀 快速上手
### 项目结构
RationaleRM/
├── metajudge_infer.py # 语义匹配推理脚本
├── metajudge_infer.sh # 推理运行脚本
├── metajudge_analysis.py # 指标计算分析脚本
├── images/ # 图片资源目录
│ ├── overall_compare.png
│ └── reward_compare.png
├── data/ # 数据集目录
│ ├── helpsteer3_test_1000.jsonl # 测试集:1000个样本
│ └── helpsteer3_human_checklist.jsonl # 完整数据集(22116个样本)
└── example/ # 测试示例数据目录
├── infer_input_10samples.jsonl
├── model-low_deceptive_alignment.jsonl
└── model-high_deceptive_alignment.jsonl
### 步骤1:数据准备
输入数据需为JSONL格式,需包含以下字段:
- `human-checklist`:人类标注的原子推理依据列表(参考基准)
- `{model}-checklist`:待评估的模型生成原子推理依据列表
示例:
json
{
"domain": "general",
"context": [...],
"response1": "...",
"response2": "...",
"human-checklist": [
"Response 1 lacks polysyllabic rhymes",
"Response 2's meter is inconsistent"
],
"model-low_deceptive_alignment-checklist": [
"Response A's rhyme scheme is forced",
"Response B's rhythm feels awkward"
]
}
### 步骤2:运行推理
推理脚本用于评估模型生成的推理依据列表与人类基准列表的匹配程度:
bash
# 设置环境变量
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.openai.com/v1" # 可选,默认使用OpenAI官方接口
# 启动推理
python metajudge_infer.py
--input-file data/helpsteer3_test_1000.jsonl
--output-file output/results.jsonl
--model gpt-4o
--model-be-evaluated model-low_deceptive_alignment
--concurrent-requests 5
或使用Shell脚本运行:
bash
bash metajudge_infer.sh
关键参数说明:
- `--input-file`:输入JSONL文件路径
- `--output-file`:结果输出文件路径
- `--model`:用于语义匹配的大语言模型(如gpt-4o、qwen-plus)
- `--model-be-evaluated`:待评估的模型名称,其生成的推理依据将被校验
- `--concurrent-requests`:并行API请求数
API配置(通过环境变量或命令行参数指定):
- `OPENAI_API_KEY` 或 `--api-key`:大语言模型服务的API密钥
- `OPENAI_BASE_URL` 或 `--api-base`:API基础URL(默认:https://api.openai.com/v1)
### 步骤3:结果分析
可计算精确率、召回率、F1值与平均精度(Average Precision, AP):
bash
# 单文件分析
python metajudge_analysis.py
--input-file example/low_deceptive_alignment_infer_output.jsonl
--model-be-evaluated model-low_deceptive_alignment
# 目录下所有文件分析
python metajudge_analysis.py
--input-dir example/
--sort-by recall
输出示例:
text
====================================================================================================
按召回率排序的结果
====================================================================================================
Model Precision Recall F1 AP Valid
----------------------------------------------------------------------------------------------------
model-low_deceptive_alignment 0.3300 0.4297 0.3684 0.3991 10
model-high_deceptive_alignment 0.1850 0.2242 0.1985 0.2376 10
====================================================================================================
---
## 📊 评估指标
MetaJudge可计算以下指标:
| 指标 | 说明 |
|--------|-------------|
| **召回率(Recall)** | 模型推理依据匹配到的人类推理依据占比 |
| **精确率(Precision)** | 模型推理依据中与人类推理依据匹配的占比(用于评估) |
| **F1值** | 精确率与召回率的调和平均值 |
| **平均精度(Average Precision, AP)** | 本文训练阶段使用的指标 |
---
<a id="dataset"></a>
## 📂 数据集
我们提供两类数据集:
### 1. HelpSteer3人类推理依据完整数据集
**`helpsteer3_human_checklist.jsonl`** 包含完整的HelpSteer3数据集与人类标注的原子推理依据,适用于模型训练。
### 2. 带模型推理依据的测试集
**`helpsteer3_test_1000.jsonl`** 包含本文测试所用的1000个精选样本。我们提供了分别代表不同欺骗性对齐程度的两类模型推理依据:
| 字段 | 说明 |
|-------|-------------|
| `human-checklist` | 人类标注的原子推理依据(参考基准) |
| `model-low_deceptive_alignment-checklist` | 低欺骗性对齐模型推理依据(对应本文中的高推理一致性) |
| `model-low_deceptive_alignment-label` | 低欺骗性对齐模型的偏好标签 |
| `model-low_deceptive_alignment-generated_text` | 低欺骗性对齐模型的完整生成文本 |
| `model-high_deceptive_alignment-checklist` | 高欺骗性对齐模型推理依据(对应本文中的低推理一致性) |
| `model-high_deceptive_alignment-label` | 高欺骗性对齐模型的偏好标签 |
| `model-high_deceptive_alignment-generated_text` | 高欺骗性对齐模型的完整生成文本 |
> **注意:**
> - 原子推理依据由GPT-5生成,仅用于研究目的。
> - `model-high_deceptive_alignment`与`model-low_deceptive_alignment`的数据仅用于测试/评估,未参与模型训练。
---
<a id="citation"></a>
## 📜 引用
若您的工作受益于本项目,请引用我们的论文:
bibtex
@article{wang2026outcome,
title={Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models},
author={Wang, Binghai and Liu, Yantao and Liu, Yuxuan and Tang, Tianyi and Wang, Shenzhi and Gao, Chang and Zheng, Chujie and Zhang, Yichang and Yu, Le and Liu, Shixuan and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Yu, Bowen and Huang, Fei and Lin, Junyang},
journal={arXiv preprint arXiv:2602.04649},
year={2026}
}
---
<div align="center">
**由Qwen团队与复旦大学联合开发**
</div>
提供机构:
maas
创建时间:
2026-02-05



