MageBench
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/microsoft/MageBench
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为MageBench,是一个专注于推理能力的多模态代理基准测试,涵盖了三种环境类型:WebUI、推箱子(Sokoban)和足球,总共包含483个不同场景,用以验证代理在知识、视觉智能和交互技能方面的能力。该基准测试强调链式视觉推理,并包含了需要多次重复以最小化随机波动的评估指标。规模上,数据集包含了483个不同的场景。任务旨在评估大型多模态模型在代理设置中的推理能力。
This dataset is named MageBench, a multimodal agent benchmark focused on reasoning capabilities. It covers three types of environments: WebUI, Sokoban, and Soccer, with a total of 483 distinct scenarios, designed to verify agents' competencies in knowledge, visual intelligence and interactive skills. This benchmark emphasizes chained visual reasoning, and includes evaluation metrics that require multiple repetitions to minimize random fluctuations. The dataset, featuring these 483 unique scenarios, aims to evaluate the reasoning abilities of large multimodal models in an agent-based setting.
提供机构:
Microsoft



