five

wyy1112/Plan-RewardBench

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/wyy1112/Plan-RewardBench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - reinforcement-learning language: - en tags: - reward-model - agent - tool-use - trajectory - preference - benchmark - evaluation pretty_name: Plan-RewardBench size_categories: - 1K<n<10K --- # 🏆 Plan-RewardBench **A Comprehensive Benchmark for Trajectory-Level Reward Modeling in Tool-Augmented Agents** [![arXiv](https://img.shields.io/badge/arXiv-2604.08178-b31b1b.svg)](https://arxiv.org/abs/2604.08178) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717.svg?logo=github)](https://github.com/wyy-1112/Plan-RewardBench) [![License](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) [![ACL 2026](https://img.shields.io/badge/Conference-ACL_2026_Main-purple.svg)]() > **⚠️ Important**: This is an **evaluation-only** benchmark. The HuggingFace `train` split is simply the default container for the full benchmark data — it does **not** represent a training set. The dataset viewer may be temporarily unavailable; data can still be loaded and downloaded normally. ## Overview Plan-RewardBench is a trajectory-level preference benchmark with **1,171 pairwise comparisons** across 7 evaluation splits and 4 scenario families, designed to evaluate reward models and LLM judges in complex tool-integrated reasoning scenarios. ## Dataset Structure | Split | #Pairs | Description | |---|---|---| | `planning_single_easy` | 144 | Single-turn planning with straightforward constraints | | `planning_single_hard` | 158 | Single-turn planning with complex/dynamic constraints | | `planning_multi_easy` | 109 | Multi-turn planning with moderate horizon | | `planning_multi_hard` | 73 | Multi-turn planning with long horizon | | `robust_recovery` | 361 | Recovery from tool errors, partial failures | | `safety_refusal` | 51 | Safe refusal vs unsafe compliance | | `tool_irrelevance` | 275 | Recognizing irrelevant/unavailable tools | ## Quick Start ```python from datasets import load_dataset # Load the full benchmark dataset = load_dataset("wyy1112/Plan-RewardBench") # Filter by scenario family for item in dataset["train"]: if item["_lcp_bucket"] == "planning_multi_hard": chosen_msgs = item["chosen"]["messages"] reject_msgs = item["reject"]["messages"] print(f"UUID: {item['uuid']}, Chosen turns: {len(chosen_msgs)}, Reject turns: {len(reject_msgs)}") ``` Or load directly from JSONL files: ```python import json with open("data/planning_multi_easy.jsonl") as f: for line in f: item = json.loads(line) print(item["uuid"], len(item["chosen"]["messages"]), "turns") ``` ## Data Format Each instance contains: - **`query`**: User's task description - **`tools`**: Available tool definitions (OpenAI function-calling format) - **`uuid`**: Unique identifier - **`chosen`**: Preferred trajectory (`{"messages": [...]}`) - **`reject`**: Distractor trajectory (`{"messages": [...]}`) - **`_lcp_bucket`**: Scenario family label (e.g., `planning_multi_easy`, `robust_recovery`) Messages use roles: `user`, `assistant`, `tool_call`, `tool_response`. ## Citation ```bibtex @article{wang2026aligning, title={Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling}, author={Wang, Jiaxuan and Hu, Yulan and Yang, Wenjin and Pan, Zheng and Li, Xin and Guo, Lan-Zhe}, journal={arXiv preprint arXiv:2604.08178}, year={2026} } ``` ## License This dataset is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).

许可证:CC BY 4.0(知识共享署名4.0国际许可协议) 任务类别: - 文本生成 - 强化学习 语言: - 英语 标签: - 奖励模型(reward-model) - 智能体(agent) - 工具使用(tool-use) - 轨迹(trajectory) - 偏好(preference) - 基准测试(benchmark) - 评估(evaluation) 友好名称:Plan-RewardBench 样本量区间:1000 < 样本数 < 10000 # 🏆 Plan-RewardBench **面向工具增强型智能体的轨迹级奖励建模综合基准测试** [![arXiv](https://img.shields.io/badge/arXiv-2604.08178-b31b1b.svg)](https://arxiv.org/abs/2604.08178) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717.svg?logo=github)](https://github.com/wyy-1112/Plan-RewardBench) [![License](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) [![ACL 2026](https://img.shields.io/badge/Conference-ACL_2026_Main-purple.svg)]() > ⚠️ 重要提示:本基准仅用于**评估场景**。Hugging Face的`train`拆分仅为完整基准数据集的默认容器,并非训练集。数据集查看器可能暂时不可用,但数据仍可正常加载与下载。 ## 概述 Plan-RewardBench是一款轨迹级偏好基准测试,涵盖7个评估拆分与4个场景家族,总计1171组成对比较样本,旨在评估复杂工具集成推理场景下的奖励模型与大语言模型(Large Language Model,LLM)评判器。 ## 数据集结构 | 拆分名称 | 成对样本数 | 描述 | |---|---|---| | `planning_single_easy` | 144 | 含简单约束的单轮规划 | | `planning_single_hard` | 158 | 含复杂/动态约束的单轮规划 | | `planning_multi_easy` | 109 | 含中等推理步数的多轮规划 | | `planning_multi_hard` | 73 | 含长推理步数的多轮规划 | | `robust_recovery` | 361 | 工具错误、局部故障的恢复场景 | | `safety_refusal` | 51 | 安全拒绝 vs 不安全依从 | | `tool_irrelevance` | 275 | 识别无关/不可用工具 | ## 快速入门 python from datasets import load_dataset # 加载完整基准数据集 dataset = load_dataset("wyy1112/Plan-RewardBench") # 按场景家族筛选样本 for item in dataset["train"]: if item["_lcp_bucket"] == "planning_multi_hard": chosen_msgs = item["chosen"]["messages"] reject_msgs = item["reject"]["messages"] print(f"UUID: {item['uuid']}, 优选轨迹轮次: {len(chosen_msgs)}, 劣选轨迹轮次: {len(reject_msgs)}") 或直接从JSONL文件加载: python import json with open("data/planning_multi_easy.jsonl") as f: for line in f: item = json.loads(line) print(item["uuid"], len(item["chosen"]["messages"]), "轮次") ## 数据格式 每个样本包含以下字段: - **`query`**:用户任务描述 - **`tools`**:可用工具定义(采用OpenAI函数调用格式) - **`uuid`**:唯一标识符 - **`chosen`**:优选轨迹(格式为`{"messages": [...]}`) - **`reject`**:干扰项轨迹(格式为`{"messages": [...]}`) - **`_lcp_bucket`**:场景家族标签(例如`planning_multi_easy`、`robust_recovery`) 消息字段支持以下角色类型:`user`(用户)、`assistant`(助手)、`tool_call`(工具调用)、`tool_response`(工具返回结果)。 ## 引用 bibtex @article{wang2026aligning, title={Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling}, author={Wang, Jiaxuan and Hu, Yulan and Yang, Wenjin and Pan, Zheng and Li, Xin and Guo, Lan-Zhe}, journal={arXiv preprint arXiv:2604.08178}, year={2026} } ## 许可证 本数据集采用[CC BY 4.0(知识共享署名4.0国际许可协议)](https://creativecommons.org/licenses/by/4.0/)许可协议发布。
提供机构:
wyy1112
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作