wyy1112/Plan-RewardBench
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/wyy1112/Plan-RewardBench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- reinforcement-learning
language:
- en
tags:
- reward-model
- agent
- tool-use
- trajectory
- preference
- benchmark
- evaluation
pretty_name: Plan-RewardBench
size_categories:
- 1K<n<10K
---
# 🏆 Plan-RewardBench
**A Comprehensive Benchmark for Trajectory-Level Reward Modeling in Tool-Augmented Agents**
[](https://arxiv.org/abs/2604.08178)
[](https://github.com/wyy-1112/Plan-RewardBench)
[](https://creativecommons.org/licenses/by/4.0/)
[]()
> **⚠️ Important**: This is an **evaluation-only** benchmark. The HuggingFace `train` split is simply the default container for the full benchmark data — it does **not** represent a training set. The dataset viewer may be temporarily unavailable; data can still be loaded and downloaded normally.
## Overview
Plan-RewardBench is a trajectory-level preference benchmark with **1,171 pairwise comparisons** across 7 evaluation splits and 4 scenario families, designed to evaluate reward models and LLM judges in complex tool-integrated reasoning scenarios.
## Dataset Structure
| Split | #Pairs | Description |
|---|---|---|
| `planning_single_easy` | 144 | Single-turn planning with straightforward constraints |
| `planning_single_hard` | 158 | Single-turn planning with complex/dynamic constraints |
| `planning_multi_easy` | 109 | Multi-turn planning with moderate horizon |
| `planning_multi_hard` | 73 | Multi-turn planning with long horizon |
| `robust_recovery` | 361 | Recovery from tool errors, partial failures |
| `safety_refusal` | 51 | Safe refusal vs unsafe compliance |
| `tool_irrelevance` | 275 | Recognizing irrelevant/unavailable tools |
## Quick Start
```python
from datasets import load_dataset
# Load the full benchmark
dataset = load_dataset("wyy1112/Plan-RewardBench")
# Filter by scenario family
for item in dataset["train"]:
if item["_lcp_bucket"] == "planning_multi_hard":
chosen_msgs = item["chosen"]["messages"]
reject_msgs = item["reject"]["messages"]
print(f"UUID: {item['uuid']}, Chosen turns: {len(chosen_msgs)}, Reject turns: {len(reject_msgs)}")
```
Or load directly from JSONL files:
```python
import json
with open("data/planning_multi_easy.jsonl") as f:
for line in f:
item = json.loads(line)
print(item["uuid"], len(item["chosen"]["messages"]), "turns")
```
## Data Format
Each instance contains:
- **`query`**: User's task description
- **`tools`**: Available tool definitions (OpenAI function-calling format)
- **`uuid`**: Unique identifier
- **`chosen`**: Preferred trajectory (`{"messages": [...]}`)
- **`reject`**: Distractor trajectory (`{"messages": [...]}`)
- **`_lcp_bucket`**: Scenario family label (e.g., `planning_multi_easy`, `robust_recovery`)
Messages use roles: `user`, `assistant`, `tool_call`, `tool_response`.
## Citation
```bibtex
@article{wang2026aligning,
title={Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling},
author={Wang, Jiaxuan and Hu, Yulan and Yang, Wenjin and Pan, Zheng and Li, Xin and Guo, Lan-Zhe},
journal={arXiv preprint arXiv:2604.08178},
year={2026}
}
```
## License
This dataset is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
许可证:CC BY 4.0(知识共享署名4.0国际许可协议)
任务类别:
- 文本生成
- 强化学习
语言:
- 英语
标签:
- 奖励模型(reward-model)
- 智能体(agent)
- 工具使用(tool-use)
- 轨迹(trajectory)
- 偏好(preference)
- 基准测试(benchmark)
- 评估(evaluation)
友好名称:Plan-RewardBench
样本量区间:1000 < 样本数 < 10000
# 🏆 Plan-RewardBench
**面向工具增强型智能体的轨迹级奖励建模综合基准测试**
[](https://arxiv.org/abs/2604.08178)
[](https://github.com/wyy-1112/Plan-RewardBench)
[](https://creativecommons.org/licenses/by/4.0/)
[]()
> ⚠️ 重要提示:本基准仅用于**评估场景**。Hugging Face的`train`拆分仅为完整基准数据集的默认容器,并非训练集。数据集查看器可能暂时不可用,但数据仍可正常加载与下载。
## 概述
Plan-RewardBench是一款轨迹级偏好基准测试,涵盖7个评估拆分与4个场景家族,总计1171组成对比较样本,旨在评估复杂工具集成推理场景下的奖励模型与大语言模型(Large Language Model,LLM)评判器。
## 数据集结构
| 拆分名称 | 成对样本数 | 描述 |
|---|---|---|
| `planning_single_easy` | 144 | 含简单约束的单轮规划 |
| `planning_single_hard` | 158 | 含复杂/动态约束的单轮规划 |
| `planning_multi_easy` | 109 | 含中等推理步数的多轮规划 |
| `planning_multi_hard` | 73 | 含长推理步数的多轮规划 |
| `robust_recovery` | 361 | 工具错误、局部故障的恢复场景 |
| `safety_refusal` | 51 | 安全拒绝 vs 不安全依从 |
| `tool_irrelevance` | 275 | 识别无关/不可用工具 |
## 快速入门
python
from datasets import load_dataset
# 加载完整基准数据集
dataset = load_dataset("wyy1112/Plan-RewardBench")
# 按场景家族筛选样本
for item in dataset["train"]:
if item["_lcp_bucket"] == "planning_multi_hard":
chosen_msgs = item["chosen"]["messages"]
reject_msgs = item["reject"]["messages"]
print(f"UUID: {item['uuid']}, 优选轨迹轮次: {len(chosen_msgs)}, 劣选轨迹轮次: {len(reject_msgs)}")
或直接从JSONL文件加载:
python
import json
with open("data/planning_multi_easy.jsonl") as f:
for line in f:
item = json.loads(line)
print(item["uuid"], len(item["chosen"]["messages"]), "轮次")
## 数据格式
每个样本包含以下字段:
- **`query`**:用户任务描述
- **`tools`**:可用工具定义(采用OpenAI函数调用格式)
- **`uuid`**:唯一标识符
- **`chosen`**:优选轨迹(格式为`{"messages": [...]}`)
- **`reject`**:干扰项轨迹(格式为`{"messages": [...]}`)
- **`_lcp_bucket`**:场景家族标签(例如`planning_multi_easy`、`robust_recovery`)
消息字段支持以下角色类型:`user`(用户)、`assistant`(助手)、`tool_call`(工具调用)、`tool_response`(工具返回结果)。
## 引用
bibtex
@article{wang2026aligning,
title={Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling},
author={Wang, Jiaxuan and Hu, Yulan and Yang, Wenjin and Pan, Zheng and Li, Xin and Guo, Lan-Zhe},
journal={arXiv preprint arXiv:2604.08178},
year={2026}
}
## 许可证
本数据集采用[CC BY 4.0(知识共享署名4.0国际许可协议)](https://creativecommons.org/licenses/by/4.0/)许可协议发布。
提供机构:
wyy1112



