JasonMun7/GroundCUA

Name: JasonMun7/GroundCUA
Creator: JasonMun7
Published: 2026-03-05 19:44:56
License: 暂无描述

Hugging Face2026-03-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/JasonMun7/GroundCUA

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - computer_use - agents - grounding - multimodal - ui-vision - GroundCUA size_categories: - "1M<n<10M" license: mit task_categories: - image-to-text ---  <h1 align="center" style="font-size:42px; font-weight:700;"> GroundCUA: Grounding Computer Use Agents on Human Demonstrations </h1> <p align="center"> 🌐 <a href="https://groundcua.github.io">Website</a> | 📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a> | 🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a> | 🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Models</a> </p> <p align="center"> <img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA Overview"> </p> # GroundCUA Dataset GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers **87 software platforms** across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI grounding, UI perception, and vision-language-action models that interact with computers. --- ## Highlights - **87 platforms** spanning Windows, macOS, Linux, and cross-platform apps - **Annotated UI elements** with bounding boxes, text, and coarse semantic categories - **SHA-256 file pairing** between screenshots and JSON annotations - **Supports research on GUI grounding, multimodal agents, and UI understanding** - **MIT license** for broad academic and open source use --- ## Dataset Structure ``` GroundCUA/ ├── data/ # JSON annotation files ├── images/ # Screenshot images └── README.md ``` ### Directory Layout Each platform appears as a directory name inside both `data/` and `images/`. - `data/PlatformName/` contains annotation JSON files - `images/PlatformName/` contains corresponding PNG screenshots Image and annotation files share the same SHA-256 hash. --- ## File Naming Convention Each screenshot has a matching annotation file using the same hash: - `data/PlatformName/[hash].json` - `images/PlatformName/[hash].png` This structure ensures: - Unique identifiers for each screenshot - Easy pairing between images and annotations - Compatibility with pipelines that expect hash-based addressing --- ## Annotation Format Each annotation file is a list of UI element entries describing visible elements in the screenshot. ```json [ { "image_path": "PlatformName/screenshot_hash.png", "bbox": [x1, y1, x2, y2], "text": "UI element text", "category": "Element category", "id": "unique-id" } ] ``` ### Field Descriptions **image_path** Relative path to the screenshot. **bbox** Bounding box coordinates `[x1, y1, x2, y2]` in pixel space. **text** Visible text or a short description of the element. **category** Coarse UI type label. Present only for some elements. **id** Unique identifier for the annotation entry. --- ## UI Element Categories Categories are approximate and not guaranteed for all elements. Examples include: - **Button** - **Menu** - **Input Elements** - **Navigation** - **Sidebar** - **Visual Elements** - **Information Display** - **Others** These labels provide light structure for UI grounding tasks but do not form a full ontology. --- ## Example Use Cases GroundCUA can be used for: - Training computer use agents to perceive and understand UI layouts - Building GUI grounding modules for VLA agents - Pretraining screen parsing and UI element detectors - Benchmarking OCR, layout analysis, and cross-platform UI parsing - Developing models that map UI regions to natural language or actions --- ## Citation If you use GroundCUA in your research, please cite our work: ```bibtex @misc{feizi2025groundingcomputeruseagents, title={Grounding Computer Use Agents on Human Demonstrations}, author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar}, year={2025}, eprint={2511.07332}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2511.07332}, } ``` ## License GroundCUA is released under the MIT License. Users are responsible for ensuring compliance with all applicable laws and policies.

--- language: - 英语 tags: - 计算机交互 - 智能体 - 接地（Grounding） - 多模态 - UI视觉 - GroundCUA size_categories: - "100万 < 样本量 < 1000万" license: MIT task_categories: - 图像到文本 ---  <h1 align="center" style="font-size:42px; font-weight:700;"> GroundCUA：基于人类演示的计算机交互智能体接地 </h1> <p align="center"> 🌐 <a href="https://groundcua.github.io">官网</a> | 📑 <a href="https://arxiv.org/abs/2511.07332">学术论文</a> | 🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">数据集</a> | 🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">预训练模型</a> </p> <p align="center"> <img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA 概览"> </p> # GroundCUA 数据集 GroundCUA是一个大规模、多样化的真实UI截图数据集，搭配结构化标注，用于构建多模态计算机交互智能体。该数据集覆盖87个软件平台，涵盖生产力工具、浏览器、创意工具、通讯应用、开发环境与系统工具等场景。GroundCUA专为GUI接地（GUI Grounding）、UI感知以及人机交互的视觉-语言-动作模型研究设计。 --- ## 数据集亮点 - **87个软件平台**：覆盖Windows、macOS、Linux及跨平台应用 - **UI元素标注**：为UI元素提供边界框、可见文本与粗粒度语义类别的标注 - **SHA-256哈希配对**：截图与JSON标注文件通过SHA-256哈希值实现一一配对 - **研究支持**：支持GUI接地、多模态智能体与UI理解相关研究 - **MIT许可证**：可广泛用于学术与开源场景 --- ## 数据集结构 GroundCUA/ ├── data/ # JSON 标注文件目录 ├── images/ # 截图文件目录 └── README.md # 项目说明文档 ### 目录布局每个软件平台在`data/`与`images/`下均对应同名子目录。 - `data/PlatformName/`：存放对应平台的JSON格式标注文件 - `images/PlatformName/`：存放对应平台的PNG格式截图截图与标注文件共享相同的SHA-256哈希值。 --- ## 文件命名规范每张截图均配有同名哈希值的标注文件，格式如下： - `data/PlatformName/[hash].json` - `images/PlatformName/[hash].png` 该命名结构可实现： - 为每张截图分配唯一标识符 - 快速实现图片与标注文件的配对 - 兼容基于哈希寻址的处理流水线 --- ## 标注格式每个标注文件为UI元素条目列表，用于描述截图中的可见元素。示例格式如下： json [ { "image_path": "PlatformName/screenshot_hash.png", "bbox": [x1, y1, x2, y2], "text": "UI元素文本内容", "category": "元素类别", "id": "唯一标识符" } ] ### 字段说明 **image_path**：截图的相对路径 **bbox**：像素空间内的边界框坐标 `[x1, y1, x2, y2]` **text**：UI元素的可见文本或简短描述 **category**：粗粒度UI类型标签，仅部分元素提供该字段 **id**：标注条目的唯一标识符 --- ## UI元素类别类别为近似分类，并非所有UI元素均提供类别标注。常见类别示例包括： - **按钮（Button）** - **菜单（Menu）** - **输入控件（Input Elements）** - **导航控件（Navigation）** - **侧边栏（Sidebar）** - **视觉元素（Visual Elements）** - **信息展示区（Information Display）** - **其他（Others）** 上述标签仅为UI接地任务提供轻量级结构支撑，并未形成完整的本体体系。 --- ## 典型应用场景 GroundCUA可应用于以下场景： - 训练计算机交互智能体以感知并理解UI布局 - 构建面向视觉-语言-动作（Vision-Language-Action, VLA）智能体的GUI接地模块 - 预训练屏幕解析与UI元素检测器 - 基准测试OCR（光学字符识别）、布局分析与跨平台UI解析能力 - 开发将UI区域映射至自然语言或动作的模型 --- ## 引用格式如果在研究中使用GroundCUA，请引用以下文献： bibtex @misc{feizi2025groundingcomputeruseagents, title={Grounding Computer Use Agents on Human Demonstrations}, author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar}, year={2025}, eprint={2511.07332}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2511.07332}, } ## 许可证 GroundCUA采用MIT许可证发布。使用者需确保其使用符合所有适用法律法规与政策要求。

提供机构：

JasonMun7

5,000+

优质数据集

54 个

任务类型

进入经典数据集