MMBench-GUI

Name: MMBench-GUI
Creator: maas
Published: 2026-05-15 09:59:12
License: 暂无描述

魔搭社区2026-05-15 更新2025-06-28 收录

下载链接：

https://modelscope.cn/datasets/OpenGVLab/MMBench-GUI

下载链接

链接失效反馈

官方服务：

资源简介：

# 🖥️ MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents <a href="https://huggingface.co/papers/2507.19478">📖 Paper</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://github.com/open-compass/MMBench-GUI">💻 Code</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://huggingface.co/datasets/OpenGVLab/MMBench-GUI">🤗 Dataset</a>&nbsp&nbsp | &nbsp&nbsp📢 <a href="#">Leaderboard (coming soon)</a> ## Paper Abstract We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at this https URL . ## Introduction We are happy to release MMBench-GUI, a hierarchical, multi-platform benchmark framework and toolbox, to evaluate GUI agents. MMBench-GUI is comprising four evaluation levels: GUI Content Understanding, GUI Element Grounding, GUI Task Automation, and GUI Task Collaboration. We also propose the Efficiency–Quality Area (EQA) metric for agent navigation, integrating accuracy and efficiency. MMBench-GUI provides a rigorous standard for evaluating and guiding future developments in GUI agent capabilities. MMBench-GUI is developed based on [VLMEvalkit](https://github.com/open-compass/VLMEvalKit), supporting the evaluation of models in a API manner or local deployment manner. We hope that MMBench-GUI will enable more researchers to evaluate agents more efficiently and comprehensively. ![level1_example](https://github.com/open-compass/MMBench-GUI/blob/main/assets/L1_example.png) ![level2_example](https://github.com/open-compass/MMBench-GUI/blob/main/assets/L2_example.png) ![level3_example](https://github.com/open-compass/MMBench-GUI/blob/main/assets/L3_example.png) ![level4_example](https://github.com/open-compass/MMBench-GUI/blob/main/assets/L4_example.png) Examples of each level of tasks ### Features * **Hierarchical Evaluation**: We developed a hierarchical evaluation framework to systematically and comprehensively assess GUI agents' capabilities. In short, we organize the evaluation framework into four ascending levels, termed as L1~L4. * **Support multi-platform evaluation**: we establish a robust, multi-platform evaluation dataset encompassing diverse operating systems, such as Windows, macOS, Linux, iOS, Android, and Web interfaces, ensuring extensive coverage and relevance to real-world applications. * **A more human-aligned evaluation metric for planning**: We value both speed and quality of the agent. Therefore, we propose the Efficiency–Quality Area (EQA) metric that balances accuracy and efficiency, rewarding agents that achieve task objectives with minimal operational step, to replace Success Rate (SR). * **Manually reviewed and optimized online task setup**: We conducted a thorough review of existing online tasks and excluded those that could not be completed due to issues such as network or account restrictions. * **More up-to-date evaluation data and more comprehensive task design**: We collected, annotated, and processed additional evaluation data through a semi-automated workflow to better assess the agent’s localization and understanding capabilities. Overall, the benchmark comprises over 8,000 tasks spanning various operating platforms. ## Data structure After downloading this repo, you should extract the zip file and organize these files as below structure: ```text DATA_ROOT/ // We use LMUData in VLMEvalkit as default root dir. |-- MMBench-GUI/ | |-- offline_images/ | | |-- os_windows/ | | | |-- 0b08bd98_a0e7b2a5_68e346390d562be39f55c1aa7db4a5068d16842c0cb29bd1c6e3b49292a242d1.png | | | |-- ... | | |-- os_mac/ | | |-- os_linux/ | | |-- os_ios/ | | |-- os_android/ | | `-- os_web/ | |-- L1_annotations.json `---|-- L2_annotations.json ``` ## Usage For detailed instructions on installation, data preparation, evaluation, and integrating your own models, please refer to the [MMBench-GUI GitHub repository](https://github.com/open-compass/MMBench-GUI). ## Citation If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :) ```Bibtex @article{wang2025mmbenchgui, title = {MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents}, author = {Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Shiqian Su, Chenyu Yang, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang}, journal = {arXiv preprint arXiv:2507.19478}, year = {2025} } ```

# 🖥️ MMBench-GUI：面向GUI智能体（GUI Agent）的分层多平台评测框架 <a href="https://huggingface.co/papers/2507.19478">📖 论文</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://github.com/open-compass/MMBench-GUI">💻 代码</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://huggingface.co/datasets/OpenGVLab/MMBench-GUI">🤗 数据集</a>&nbsp&nbsp | &nbsp&nbsp📢 <a href="#">排行榜（即将上线）</a> ## 论文摘要我们提出MMBench-GUI，一款面向GUI智能体（GUI Agent）的分层评测基准，可实现Windows、macOS、Linux、iOS、Android及Web平台的GUI自动化智能体跨平台评测。该基准包含四个评测层级：GUI内容理解（GUI Content Understanding）、元素锚定（Element Grounding）、任务自动化（Task Automation）与任务协作（Task Collaboration），覆盖GUI智能体所需的核心技能。此外，我们提出一种新型的效率-质量区域（Efficiency-Quality Area, EQA）评测指标，用于在线自动化场景下评估GUI智能体的执行效率。通过MMBench-GUI，我们发现精准的视觉锚定是决定任务整体成功与否的关键因素，并证实集成专用锚定模块的模块化框架具备显著优势。此外，要实现可靠的GUI自动化，智能体需具备强大的任务规划与跨平台泛化能力，长上下文记忆、丰富动作空间与长期推理能力均发挥关键作用。更重要的是，任务效率仍是一个严重未被充分探索的维度，当前所有模型均存在显著的效率低下问题，即便最终完成任务，也往往存在大量冗余操作步骤。唯有结合精准定位、高效规划与提前终止策略，方能实现真正高效且可扩展的GUI自动化。本基准的代码、评测数据与运行环境将通过本链接公开获取。 ## 引言我们很高兴推出MMBench-GUI——一款用于GUI智能体评测的分层多平台评测框架与工具箱。MMBench-GUI包含四大评测层级：GUI内容理解、GUI元素锚定、GUI任务自动化与GUI任务协作。此外，我们提出效率-质量区域（EQA）评测指标，用于智能体导航任务的评估，该指标兼顾准确率与运行效率，旨在奖励以最少操作步骤达成任务目标的智能体，以此取代传统的成功率（Success Rate, SR）指标。MMBench-GUI为GUI智能体能力的评测与未来发展提供了严谨的评测标准。 MMBench-GUI基于[VLMEvalkit](https://github.com/open-compass/VLMEvalKit)开发，支持以API调用或本地部署的方式对模型进行评测。我们期望MMBench-GUI能够帮助更多研究人员更高效、全面地评测智能体。 ![level1_example](https://github.com/open-compass/MMBench-GUI/blob/main/assets/L1_example.png) ![level2_example](https://github.com/open-compass/MMBench-GUI/blob/main/assets/L2_example.png) ![level3_example](https://github.com/open-compass/MMBench-GUI/blob/main/assets/L3_example.png) ![level4_example](https://github.com/open-compass/MMBench-GUI/blob/main/assets/L4_example.png) 各层级任务示例 ### 核心特性 * **分层评测**：我们构建了分层评测框架，以系统全面地评估GUI智能体的各项能力。简而言之，我们将评测框架划分为四个递进层级，记为L1~L4。 * **多平台支持**：我们构建了覆盖多样操作系统的稳健多平台评测数据集，涵盖Windows、macOS、Linux、iOS、Android及Web界面，确保评测范围广泛且贴合实际应用场景。 * **更贴合人类需求的规划评测指标**：我们同时重视智能体的运行速度与任务质量。为此，我们提出效率-质量区域（EQA）指标，该指标兼顾准确率与运行效率，旨在奖励以最少操作步骤达成任务目标的智能体，以此取代传统的成功率（SR）指标。 * **人工审核与优化的在线任务配置**：我们对现有在线任务进行了全面审核，剔除了因网络或账号限制等问题无法完成的任务。 * **更新颖的评测数据与更全面的任务设计**：我们通过半自动化流程收集、标注并处理了额外的评测数据，以更好地评估智能体的定位与理解能力。整体而言，该基准包含超过8000个覆盖多操作系统平台的任务。 ## 数据结构下载本仓库后，请解压压缩包并按照以下结构整理文件： text DATA_ROOT/ // 我们默认以VLMEvalKit中的LMUData作为根目录。 |-- MMBench-GUI/ | |-- offline_images/ | | |-- os_windows/ | | | |-- 0b08bd98_a0e7b2a5_68e346390d562be39f55c1aa7db4a5068d16842c0cb29bd1c6e3b49292a242d1.png | | | |-- ... | | |-- os_mac/ | | |-- os_linux/ | | |-- os_ios/ | | |-- os_android/ | | `-- os_web/ | |-- L1_annotations.json `---|-- L2_annotations.json ## 使用方法如需了解安装、数据准备、评测以及集成自定义模型的详细指南，请参阅[MMBench-GUI GitHub仓库](https://github.com/open-compass/MMBench-GUI)。 ## 引用如果您的研究中用到了本工作的论文与代码，请考虑为仓库点亮Star :star: 并引用本工作 :pencil: :) Bibtex @article{wang2025mmbenchgui, title = {MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents}, author = {Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Shiqian Su, Chenyu Yang, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, Wenhai Wang}, journal = {arXiv preprint arXiv:2507.19478}, year = {2025} }

提供机构：

maas

创建时间：

2025-06-26

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集