grounding_dataset
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/grounding_dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Grounding Dataset
A comprehensive, high-quality dataset for GUI element grounding tasks, curated from multiple authoritative sources to provide diverse, well-annotated interface interactions.
## Overview
This dataset combines and standardizes annotations from five major GUI interaction datasets:
- **[Aria-UI](https://github.com/AriaUI/Aria-UI)**
- **[OmniAct](https://huggingface.co/datasets/Writer/omniact)**
- **[Widget Caption](https://huggingface.co/datasets/rootsautomation/RICO-WidgetCaptioning)**
- **[UI-Vision](https://huggingface.co/datasets/ServiceNow/ui-vision)**
- **[OS-Atlas](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data)**
## Dataset Schema
Each sample contains the following fields:
| Field | Type | Description | Example |
|-------|------|-------------|---------|
| `dataset` | string | Source dataset identifier | "ariaui", "omniact", "widget_caption", "ui_vision", "os_altas" |
| `uuid` | string | Unique sample identifier | "0ce7f27b-0d76-4276-a624-39fc1836b46e" |
| `image` | PIL.Image | Screenshot/interface image | RGB image object |
| `bbox` | list[int] | Bounding box coordinates [x1, y1, x2, y2] | [33, 75, 534, 132] |
| `instruction` | string | Action-focused instruction | "Tap the Search Maps field" |
| `description` | string | Visual element description | "Dark gray, rounded search bar with magnifying glass icon" |
| `function` | string | Functional purpose | "Use this input field to find a specific location" |
| `combine` | string | Comprehensive instruction | "At the top of the left sidebar, tap the dark gray search bar..." |
| `org_caption` | string | Original caption from source | "search maps" |
## Dataset Characteristics
### Domain Coverage
- **Desktop Applications**: Native desktop software interfaces
- **Web Interfaces**: Browser-based applications and websites
- **Mobile Interfaces**: Touch-based mobile applications
- **Operating Systems**: System-level interface interactions
## Applications
This dataset supports research and development in:
### Model Training
- **Vision-Language Models**: Training models to understand GUI screenshots
- **Grounding Models**: Learning to locate elements based on natural language
- **Multimodal Understanding**: Combining visual and textual information
## Usage Examples
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Salesforce/grounding_dataset")
# Access a sample
sample = dataset['train'][0]
image = sample['image'] # PIL Image
bbox = sample['bbox'] # [x1, y1, x2, y2]
instruction = sample['instruction']
```
## Licensing
This dataset inherits licenses from its constituent sources:
| Source Dataset | License |
|---------------|---------|
| Aria-UI | Apache License 2.0 |
| OmniAct | MIT License |
| Widget Caption | Creative Commons Attribution 4.0 |
| UI-Vision | MIT License |
| OS-Atlas | Apache License 2.0 |
**Important**: Each component dataset retains its original license. Please refer to the original repositories for complete licensing terms and conditions.
## Citation
If you use this dataset in your research, please cite our work:
```markdown
@article{yang2025gta1guitesttimescaling,
title={GTA1: GUI Test-time Scaling Agent},
author={Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and Junzhe Huang and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Silvio Savarese and Caiming Xiong and Junnan Li},
year={2025},
eprint={2507.05791},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.05791},
}
```
Please cite the original datasets:
```bibtex
@misc{yang2025ariauivisualgroundinggui,
title={Aria-UI: Visual Grounding for GUI Instructions},
author={Yuhao Yang and Yue Wang and Dongxu Li and Ziyang Luo and Bei Chen and Chao Huang and Junnan Li},
year={2025},
eprint={2412.16256},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2412.16256},
}
@misc{kapoor2024omniactdatasetbenchmarkenabling,
title={OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web},
author={Raghav Kapoor and Yash Parag Butala and Melisa Russak and Jing Yu Koh and Kiran Kamble and Waseem Alshikh and Ruslan Salakhutdinov},
year={2024},
eprint={2402.17553},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2402.17553},
}
@misc{li2020widgetcaptioninggeneratingnatural,
title={Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements},
author={Yang Li and Gang Li and Luheng He and Jingjie Zheng and Hong Li and Zhiwei Guan},
year={2020},
eprint={2010.04295},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2010.04295},
}
@misc{nayak2025uivisiondesktopcentricguibenchmark,
title={UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction},
author={Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Juan A. Rodriguez and Montek Kalsi and Rabiul Awal and Nicolas Chapados and M. Tamer Özsu and Aishwarya Agrawal and David Vazquez and Christopher Pal and Perouz Taslakian and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2503.15661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15661},
}
@misc{wu2024osatlasfoundationactionmodel,
title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
author={Zhiyong Wu and Zhenyu Wu and Fangzhi Xu and Yian Wang and Qiushi Sun and Chengyou Jia and Kanzhi Cheng and Zichen Ding and Liheng Chen and Paul Pu Liang and Yu Qiao},
year={2024},
eprint={2410.23218},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.23218},
}
```
# GUI元素定位数据集(Grounding Dataset)
本数据集是专为图形用户界面(Graphical User Interface, GUI)元素定位任务构建的高质量综合数据集,从多个权威来源甄选整理,涵盖多样化且标注规范的界面交互数据。
## 概览
本数据集整合并标准化了五大主流GUI交互数据集的标注内容:
- **[Aria-UI](https://github.com/AriaUI/Aria-UI)**
- **[OmniAct](https://huggingface.co/datasets/Writer/omniact)**
- **[Widget Caption](https://huggingface.co/datasets/rootsautomation/RICO-WidgetCaptioning)**
- **[UI-Vision](https://huggingface.co/datasets/ServiceNow/ui-vision)**
- **[OS-Atlas](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data)**
## 数据集结构
每个样本包含以下字段:
| 字段名 | 数据类型 | 字段描述 | 示例 |
|-------|------|-------------|---------|
| `dataset` | string | 来源数据集标识符 | "ariaui", "omniact", "widget_caption", "ui_vision", "os_altas" |
| `uuid` | string | 样本唯一标识符 | "0ce7f27b-0d76-4276-a624-39fc1836b46e" |
| `image` | PIL.Image | 截图/界面图像 | RGB图像对象 |
| `bbox` | list[int] | 边界框坐标,格式为[x1, y1, x2, y2] | [33, 75, 534, 132] |
| `instruction` | string | 聚焦操作的指令 | "Tap the Search Maps field" |
| `description` | string | 视觉元素描述 | "Dark gray, rounded search bar with magnifying glass icon" |
| `function` | string | 元素功能用途 | "Use this input field to find a specific location" |
| `combine` | string | 综合操作指令 | "At the top of the left sidebar, tap the dark gray search bar..." |
| `org_caption` | string | 来源数据集的原始标注文本 | "search maps" |
## 数据集特性
### 覆盖领域
- **桌面应用程序**:原生桌面软件界面
- **Web界面**:基于浏览器的应用程序与网站
- **移动界面**:触控式移动应用界面
- **操作系统**:系统级界面交互
## 应用场景
本数据集可支撑以下方向的研究与开发:
### 模型训练
- **视觉语言模型**:训练用于理解GUI截图的模型
- **定位模型**:学习基于自然语言定位界面元素的能力
- **多模态理解**:融合视觉与文本信息的理解任务
## 使用示例
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("Salesforce/grounding_dataset")
# 访问单个样本
sample = dataset['train'][0]
image = sample['image'] # PIL图像对象
bbox = sample['bbox'] # [x1, y1, x2, y2] 格式的边界框坐标
instruction = sample['instruction']
## 授权协议
本数据集继承自各组成数据集的授权协议:
| 来源数据集 | 授权协议 |
|---------------|---------|
| Aria-UI | Apache License 2.0 |
| OmniAct | MIT License |
| Widget Caption | 知识共享署名4.0(Creative Commons Attribution 4.0) |
| UI-Vision | MIT License |
| OS-Atlas | Apache License 2.0 |
**重要提示**:各组成数据集仍保留其原始授权协议,请查阅原始仓库以获取完整的授权条款与条件。
## 引用说明
若您在研究中使用本数据集,请引用以下文献:
markdown
@article{yang2025gta1guitesttimescaling,
title={GTA1: GUI测试时缩放智能体},
author={Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and Junzhe Huang and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Silvio Savarese and Caiming Xiong and Junnan Li},
year={2025},
eprint={2507.05791},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.05791},
}
同时请引用各原始数据集:
bibtex
@misc{yang2025ariauivisualgroundinggui,
title={Aria-UI: GUI指令视觉定位},
author={Yuhao Yang and Yue Wang and Dongxu Li and Ziyang Luo and Bei Chen and Chao Huang and Junnan Li},
year={2025},
eprint={2412.16256},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2412.16256},
}
@misc{kapoor2024omniactdatasetbenchmarkenabling,
title={OmniACT: 支撑桌面与Web多模态通用自主智能体的数据集与基准测试集},
author={Raghav Kapoor and Yash Parag Butala and Melisa Russak and Jing Yu Koh and Kiran Kamble and Waseem Alshikh and Ruslan Salakhutdinov},
year={2024},
eprint={2402.17553},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2402.17553},
}
@misc{li2020widgetcaptioninggeneratingnatural,
title={Widget Captioning: 为移动用户界面元素生成自然语言描述},
author={Yang Li and Gang Li and Luheng He and Jingjie Zheng and Hong Li and Zhiwei Guan},
year={2020},
eprint={2010.04295},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2010.04295},
}
@misc{nayak2025uivisiondesktopcentricguibenchmark,
title={UI-Vision: 面向视觉感知与交互的桌面-centric GUI基准测试集},
author={Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Juan A. Rodriguez and Montek Kalsi and Rabiul Awal and Nicolas Chapados and M. Tamer Özsu and Aishwarya Agrawal and David Vazquez and Christopher Pal and Perouz Taslakian and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2503.15661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15661},
}
@misc{wu2024osatlasfoundationactionmodel,
title={OS-ATLAS: 通用GUI智能体的基础动作模型},
author={Zhiyong Wu and Zhenyu Wu and Fangzhi Xu and Yian Wang and Qiushi Sun and Chengyou Jia and Kanzhi Cheng and Zichen Ding and Liheng Chen and Paul Pu Liang and Yu Qiao},
year={2024},
eprint={2410.23218},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.23218},
}
提供机构:
maas
创建时间:
2025-10-04



