Agent-RewardBench
收藏arXiv2025-06-26 更新2025-11-28 收录
下载链接:
https://hf-mirror.com/datasets/MultimodalAgent/Agent-RewardBench
下载链接
链接失效反馈官方服务:
资源简介:
Agent-RewardBench是一个用于评估多模态大型语言模型(MLLMs)在多模态智能体任务中的奖励建模能力的基准。该基准包含1,136个高质量样本,涵盖3个评估维度和7个现实世界智能体应用场景。数据集包括感知、规划和安全三个评估维度,涉及7种不同场景,包括移动、网络、桌面、自动驾驶、Minecraft、虚拟家和旅行规划。数据集通过两个阶段的过滤过程构建,包括使用小模型和人工标注者的过滤,以确保数据质量。
Agent-RewardBench is a benchmark for evaluating the reward modeling capabilities of multimodal large language models (MLLMs) in multimodal agent tasks. This benchmark contains 1,136 high-quality samples, covering 3 evaluation dimensions and 7 real-world agent application scenarios. The dataset includes three evaluation dimensions: perception, planning, and safety, involving 7 distinct scenarios including mobile, cybersecurity, desktop, autonomous driving, Minecraft, virtual home, and travel planning. The dataset is constructed through a two-stage filtering process that uses small models and human annotators for screening to ensure data quality.
提供机构:
中国科学院自动化研究所认知与决策智能实验室
创建时间:
2025-06-26



