ui-vision
收藏魔搭社区2025-12-05 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/ServiceNow/ui-vision
下载链接
链接失效反馈官方服务:
资源简介:
# UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
<div style="display: flex; gap: 10px;">
<a href="https://github.com/uivision/UI-Vision">
<img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white" alt="github" class="img-fluid" />
</a>
<a href="https://arxiv.org/abs/2503.15661">
<img src="https://img.shields.io/badge/arXiv-paper-b31b1b.svg?style=for-the-badge" alt="paper" class="img-fluid" />
</a>
<a href="https://uivision.github.io/">
<img src="https://img.shields.io/badge/website-%23b31b1b.svg?style=for-the-badge&logo=globe&logoColor=white" alt="website" class="img-fluid" />
</a>
</div>
## Introduction
Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.
<div align="center">
<img src="assets/data_pipeline.png" alt="Dataset Overview" width="1000" height="450" class="img-fluid" />
</div>
## Data Structure
To get started with UI-Vision:
1. Clone the repository to get the images and annotations:
```bash
git clone https://huggingface.co/datasets/ServiceNow/ui-vision
```
2. The repository is organized as follows:
```
uivision/
├── annotations/ # Dataset annotations
│ ├── element_grounding/
│ │ ├── element_grounding_basic.json
│ │ ├── element_grounding_functional.json
│ │ └── element_grounding_spatial.json
│ └── layout_grounding/
│ └── layout_grounding.json
├── images/ # Dataset images
│ ├── element_grounding/
│ └── layout_grounding/
├── assets/ # HuggingFace README assets
└── README.md
```
## Usage
To run the models:
1. Visit our [GitHub repository](https://github.com/uivision/UI-Vision) for the latest code
2. Make sure to specify the correct paths to:
- Annotation files
- Task image folders
## Citation
If you find this work useful in your research, please consider citing:
```bibtex
@misc{nayak2025uivisiondesktopcentricguibenchmark,
title={UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction},
author={Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Juan A. Rodriguez and
Montek Kalsi and Rabiul Awal and Nicolas Chapados and M. Tamer Özsu and
Aishwarya Agrawal and David Vazquez and Christopher Pal and Perouz Taslakian and
Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2503.15661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15661},
}
```
## License
This project is licensed under the MIT License.
## Intended Usage
This dataset is intended to be used by the community to evaluate and analyze their models. We are continuously striving to improve the dataset. If you have any suggestions or problems regarding the dataset, please contact the authors. We also welcome OPT-OUT requests if users want their data removed. To do so, they can either submit a PR or contact the authors directly.
# UI-Vision:面向视觉感知与交互的以桌面为中心的图形用户界面基准数据集
<div style="display: flex; gap: 10px;">
<a href="https://github.com/uivision/UI-Vision">
<img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white" alt="github" class="img-fluid" />
</a>
<a href="https://arxiv.org/abs/2503.15661">
<img src="https://img.shields.io/badge/arXiv-paper-b31b1b.svg?style=for-the-badge" alt="paper" class="img-fluid" />
</a>
<a href="https://uivision.github.io/">
<img src="https://img.shields.io/badge/website-%23b31b1b.svg?style=for-the-badge&logo=globe&logoColor=white" alt="website" class="img-fluid" />
</a>
</div>
## 简介
能够在图形用户界面(Graphical User Interface, GUI)中完成导航以自动化文档编辑、文件管理等任务的自主AI智能体(AI Agent),可显著提升计算机工作流效率。尽管现有研究多聚焦于在线场景,但对于大量专业与日常任务至关重要的桌面环境,却因数据采集难度与授权问题而未得到充分探索。本文提出UI-Vision,这是首个面向真实桌面环境下计算机操作智能体的离线、细粒度评估的宽松许可基准数据集。与在线基准数据集不同,UI-Vision具备两大核心优势:(i) 覆盖83款软件应用的人类操作演示标注,包含边界框、UI标签与动作轨迹(点击、拖拽与键盘输入)等稠密高质量注释;(ii) 涵盖三类从细到粗的任务——元素定位(Element Grounding)、布局定位(Layout Grounding)与动作预测(Action Prediction),并配有明确定义的评估指标以严格测试智能体在桌面环境中的表现。我们的评估结果显示,UI-TARS-72B等当前顶尖模型存在显著局限,例如对专业软件的理解能力、空间推理能力以及拖拽等复杂动作的处理能力不足。这些发现凸显了研发完全自主的计算机操作智能体所面临的挑战。我们将UI-Vision以开源形式发布,旨在推动面向真实桌面任务的高性能智能体的发展。
<div align="center">
<img src="assets/data_pipeline.png" alt="Dataset Overview" width="1000" height="450" class="img-fluid" />
</div>
## 数据结构
数据集使用流程如下:
1. 克隆仓库以获取图像与注释文件:
bash
git clone https://huggingface.co/datasets/ServiceNow/ui-vision
2. 仓库的目录结构如下:
uivision/
├── annotations/ # 数据集注释文件
│ ├── element_grounding/
│ │ ├── element_grounding_basic.json
│ │ ├── element_grounding_functional.json
│ │ └── element_grounding_spatial.json
│ └── layout_grounding/
│ └── layout_grounding.json
├── images/ # 数据集图像文件
│ ├── element_grounding/
│ └── layout_grounding/
├── assets/ # HuggingFace README 资源文件
└── README.md
## 使用方法
模型运行指南:
1. 访问我们的[GitHub仓库](https://github.com/uivision/UI-Vision)获取最新代码
2. 请确保正确指定以下路径:
- 注释文件路径
- 任务图像文件夹路径
## 引用
若您在研究中使用本数据集,请引用如下文献:
bibtex
@misc{nayak2025uivisiondesktopcentricguibenchmark,
title={UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction},
author={Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Juan A. Rodriguez and
Montek Kalsi and Rabiul Awal and Nicolas Chapados and M. Tamer Özsu and
Aishwarya Agrawal and David Vazquez and Christopher Pal and Perouz Taslakian and
Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2503.15661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15661},
}
## 许可协议
本项目采用MIT开源许可协议。
## 预期用途
本数据集旨在供社区用于评估与分析其研发的模型。我们将持续优化该数据集,若您对数据集有任何建议或问题,请联系作者。若您希望移除相关数据,可提交OPT-OUT申请:您可以通过提交拉取请求(PR)或直接联系作者来完成该操作。
提供机构:
maas
创建时间:
2025-11-13



