Agent-RewardBench

Name: Agent-RewardBench
Creator: 中国科学院自动化研究所认知与决策智能实验室
Published: 2025-06-26 21:36:12
License: 暂无描述

arXiv2025-06-26 更新2025-11-28 收录

下载链接：

https://hf-mirror.com/datasets/MultimodalAgent/Agent-RewardBench

下载链接

链接失效反馈

官方服务：

资源简介：

Agent-RewardBench是一个用于评估多模态大型语言模型（MLLMs）在多模态智能体任务中的奖励建模能力的基准。该基准包含1,136个高质量样本，涵盖3个评估维度和7个现实世界智能体应用场景。数据集包括感知、规划和安全三个评估维度，涉及7种不同场景，包括移动、网络、桌面、自动驾驶、Minecraft、虚拟家和旅行规划。数据集通过两个阶段的过滤过程构建，包括使用小模型和人工标注者的过滤，以确保数据质量。

Agent-RewardBench is a benchmark for evaluating the reward modeling capabilities of multimodal large language models (MLLMs) in multimodal agent tasks. This benchmark contains 1,136 high-quality samples, covering 3 evaluation dimensions and 7 real-world agent application scenarios. The dataset includes three evaluation dimensions: perception, planning, and safety, involving 7 distinct scenarios including mobile, cybersecurity, desktop, autonomous driving, Minecraft, virtual home, and travel planning. The dataset is constructed through a two-stage filtering process that uses small models and human annotators for screening to ensure data quality.

提供机构：

中国科学院自动化研究所认知与决策智能实验室

创建时间：

2025-06-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集