JasonMun7/GroundCUA
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JasonMun7/GroundCUA
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- computer_use
- agents
- grounding
- multimodal
- ui-vision
- GroundCUA
size_categories:
- "1M<n<10M"
license: mit
task_categories:
- image-to-text
---
<!-- <p align="center">
<img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA Overview">
</p> -->
<h1 align="center" style="font-size:42px; font-weight:700;">
GroundCUA: Grounding Computer Use Agents on Human Demonstrations
</h1>
<p align="center">
🌐 <a href="https://groundcua.github.io">Website</a> |
📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a> |
🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a> |
🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Models</a>
</p>
<p align="center">
<img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA Overview">
</p>
# GroundCUA Dataset
GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers **87 software platforms** across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI grounding, UI perception, and vision-language-action models that interact with computers.
---
## Highlights
- **87 platforms** spanning Windows, macOS, Linux, and cross-platform apps
- **Annotated UI elements** with bounding boxes, text, and coarse semantic categories
- **SHA-256 file pairing** between screenshots and JSON annotations
- **Supports research on GUI grounding, multimodal agents, and UI understanding**
- **MIT license** for broad academic and open source use
---
## Dataset Structure
```
GroundCUA/
├── data/ # JSON annotation files
├── images/ # Screenshot images
└── README.md
```
### Directory Layout
Each platform appears as a directory name inside both `data/` and `images/`.
- `data/PlatformName/` contains annotation JSON files
- `images/PlatformName/` contains corresponding PNG screenshots
Image and annotation files share the same SHA-256 hash.
---
## File Naming Convention
Each screenshot has a matching annotation file using the same hash:
- `data/PlatformName/[hash].json`
- `images/PlatformName/[hash].png`
This structure ensures:
- Unique identifiers for each screenshot
- Easy pairing between images and annotations
- Compatibility with pipelines that expect hash-based addressing
---
## Annotation Format
Each annotation file is a list of UI element entries describing visible elements in the screenshot.
```json
[
{
"image_path": "PlatformName/screenshot_hash.png",
"bbox": [x1, y1, x2, y2],
"text": "UI element text",
"category": "Element category",
"id": "unique-id"
}
]
```
### Field Descriptions
**image_path**
Relative path to the screenshot.
**bbox**
Bounding box coordinates `[x1, y1, x2, y2]` in pixel space.
**text**
Visible text or a short description of the element.
**category**
Coarse UI type label. Present only for some elements.
**id**
Unique identifier for the annotation entry.
---
## UI Element Categories
Categories are approximate and not guaranteed for all elements. Examples include:
- **Button**
- **Menu**
- **Input Elements**
- **Navigation**
- **Sidebar**
- **Visual Elements**
- **Information Display**
- **Others**
These labels provide light structure for UI grounding tasks but do not form a full ontology.
---
## Example Use Cases
GroundCUA can be used for:
- Training computer use agents to perceive and understand UI layouts
- Building GUI grounding modules for VLA agents
- Pretraining screen parsing and UI element detectors
- Benchmarking OCR, layout analysis, and cross-platform UI parsing
- Developing models that map UI regions to natural language or actions
---
## Citation
If you use GroundCUA in your research, please cite our work:
```bibtex
@misc{feizi2025groundingcomputeruseagents,
title={Grounding Computer Use Agents on Human Demonstrations},
author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2511.07332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.07332},
}
```
## License
GroundCUA is released under the MIT License.
Users are responsible for ensuring compliance with all applicable laws and policies.
---
language:
- 英语
tags:
- 计算机交互
- 智能体
- 接地(Grounding)
- 多模态
- UI视觉
- GroundCUA
size_categories:
- "100万 < 样本量 < 1000万"
license: MIT
task_categories:
- 图像到文本
---
<!-- <p align="center">
<img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA 概览">
</p> -->
<h1 align="center" style="font-size:42px; font-weight:700;">
GroundCUA:基于人类演示的计算机交互智能体接地
</h1>
<p align="center">
🌐 <a href="https://groundcua.github.io">官网</a> |
📑 <a href="https://arxiv.org/abs/2511.07332">学术论文</a> |
🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">数据集</a> |
🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">预训练模型</a>
</p>
<p align="center">
<img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA 概览">
</p>
# GroundCUA 数据集
GroundCUA是一个大规模、多样化的真实UI截图数据集,搭配结构化标注,用于构建多模态计算机交互智能体。该数据集覆盖87个软件平台,涵盖生产力工具、浏览器、创意工具、通讯应用、开发环境与系统工具等场景。GroundCUA专为GUI接地(GUI Grounding)、UI感知以及人机交互的视觉-语言-动作模型研究设计。
---
## 数据集亮点
- **87个软件平台**:覆盖Windows、macOS、Linux及跨平台应用
- **UI元素标注**:为UI元素提供边界框、可见文本与粗粒度语义类别的标注
- **SHA-256哈希配对**:截图与JSON标注文件通过SHA-256哈希值实现一一配对
- **研究支持**:支持GUI接地、多模态智能体与UI理解相关研究
- **MIT许可证**:可广泛用于学术与开源场景
---
## 数据集结构
GroundCUA/
├── data/ # JSON 标注文件目录
├── images/ # 截图文件目录
└── README.md # 项目说明文档
### 目录布局
每个软件平台在`data/`与`images/`下均对应同名子目录。
- `data/PlatformName/`:存放对应平台的JSON格式标注文件
- `images/PlatformName/`:存放对应平台的PNG格式截图
截图与标注文件共享相同的SHA-256哈希值。
---
## 文件命名规范
每张截图均配有同名哈希值的标注文件,格式如下:
- `data/PlatformName/[hash].json`
- `images/PlatformName/[hash].png`
该命名结构可实现:
- 为每张截图分配唯一标识符
- 快速实现图片与标注文件的配对
- 兼容基于哈希寻址的处理流水线
---
## 标注格式
每个标注文件为UI元素条目列表,用于描述截图中的可见元素。示例格式如下:
json
[
{
"image_path": "PlatformName/screenshot_hash.png",
"bbox": [x1, y1, x2, y2],
"text": "UI元素文本内容",
"category": "元素类别",
"id": "唯一标识符"
}
]
### 字段说明
**image_path**:截图的相对路径
**bbox**:像素空间内的边界框坐标 `[x1, y1, x2, y2]`
**text**:UI元素的可见文本或简短描述
**category**:粗粒度UI类型标签,仅部分元素提供该字段
**id**:标注条目的唯一标识符
---
## UI元素类别
类别为近似分类,并非所有UI元素均提供类别标注。常见类别示例包括:
- **按钮(Button)**
- **菜单(Menu)**
- **输入控件(Input Elements)**
- **导航控件(Navigation)**
- **侧边栏(Sidebar)**
- **视觉元素(Visual Elements)**
- **信息展示区(Information Display)**
- **其他(Others)**
上述标签仅为UI接地任务提供轻量级结构支撑,并未形成完整的本体体系。
---
## 典型应用场景
GroundCUA可应用于以下场景:
- 训练计算机交互智能体以感知并理解UI布局
- 构建面向视觉-语言-动作(Vision-Language-Action, VLA)智能体的GUI接地模块
- 预训练屏幕解析与UI元素检测器
- 基准测试OCR(光学字符识别)、布局分析与跨平台UI解析能力
- 开发将UI区域映射至自然语言或动作的模型
---
## 引用格式
如果在研究中使用GroundCUA,请引用以下文献:
bibtex
@misc{feizi2025groundingcomputeruseagents,
title={Grounding Computer Use Agents on Human Demonstrations},
author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2511.07332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.07332},
}
## 许可证
GroundCUA采用MIT许可证发布。使用者需确保其使用符合所有适用法律法规与政策要求。
提供机构:
JasonMun7



