five

JasonMun7/GroundCUA

收藏
Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JasonMun7/GroundCUA
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - computer_use - agents - grounding - multimodal - ui-vision - GroundCUA size_categories: - "1M<n<10M" license: mit task_categories: - image-to-text --- <!-- <p align="center"> <img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA Overview"> </p> --> <h1 align="center" style="font-size:42px; font-weight:700;"> GroundCUA: Grounding Computer Use Agents on Human Demonstrations </h1> <p align="center"> 🌐 <a href="https://groundcua.github.io">Website</a> | 📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a> | 🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a> | 🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Models</a> </p> <p align="center"> <img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA Overview"> </p> # GroundCUA Dataset GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers **87 software platforms** across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI grounding, UI perception, and vision-language-action models that interact with computers. --- ## Highlights - **87 platforms** spanning Windows, macOS, Linux, and cross-platform apps - **Annotated UI elements** with bounding boxes, text, and coarse semantic categories - **SHA-256 file pairing** between screenshots and JSON annotations - **Supports research on GUI grounding, multimodal agents, and UI understanding** - **MIT license** for broad academic and open source use --- ## Dataset Structure ``` GroundCUA/ ├── data/ # JSON annotation files ├── images/ # Screenshot images └── README.md ``` ### Directory Layout Each platform appears as a directory name inside both `data/` and `images/`. - `data/PlatformName/` contains annotation JSON files - `images/PlatformName/` contains corresponding PNG screenshots Image and annotation files share the same SHA-256 hash. --- ## File Naming Convention Each screenshot has a matching annotation file using the same hash: - `data/PlatformName/[hash].json` - `images/PlatformName/[hash].png` This structure ensures: - Unique identifiers for each screenshot - Easy pairing between images and annotations - Compatibility with pipelines that expect hash-based addressing --- ## Annotation Format Each annotation file is a list of UI element entries describing visible elements in the screenshot. ```json [ { "image_path": "PlatformName/screenshot_hash.png", "bbox": [x1, y1, x2, y2], "text": "UI element text", "category": "Element category", "id": "unique-id" } ] ``` ### Field Descriptions **image_path** Relative path to the screenshot. **bbox** Bounding box coordinates `[x1, y1, x2, y2]` in pixel space. **text** Visible text or a short description of the element. **category** Coarse UI type label. Present only for some elements. **id** Unique identifier for the annotation entry. --- ## UI Element Categories Categories are approximate and not guaranteed for all elements. Examples include: - **Button** - **Menu** - **Input Elements** - **Navigation** - **Sidebar** - **Visual Elements** - **Information Display** - **Others** These labels provide light structure for UI grounding tasks but do not form a full ontology. --- ## Example Use Cases GroundCUA can be used for: - Training computer use agents to perceive and understand UI layouts - Building GUI grounding modules for VLA agents - Pretraining screen parsing and UI element detectors - Benchmarking OCR, layout analysis, and cross-platform UI parsing - Developing models that map UI regions to natural language or actions --- ## Citation If you use GroundCUA in your research, please cite our work: ```bibtex @misc{feizi2025groundingcomputeruseagents, title={Grounding Computer Use Agents on Human Demonstrations}, author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar}, year={2025}, eprint={2511.07332}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2511.07332}, } ``` ## License GroundCUA is released under the MIT License. Users are responsible for ensuring compliance with all applicable laws and policies.

--- language: - 英语 tags: - 计算机交互 - 智能体 - 接地(Grounding) - 多模态 - UI视觉 - GroundCUA size_categories: - "100万 < 样本量 < 1000万" license: MIT task_categories: - 图像到文本 --- <!-- <p align="center"> <img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA 概览"> </p> --> <h1 align="center" style="font-size:42px; font-weight:700;"> GroundCUA:基于人类演示的计算机交互智能体接地 </h1> <p align="center"> 🌐 <a href="https://groundcua.github.io">官网</a> | 📑 <a href="https://arxiv.org/abs/2511.07332">学术论文</a> | 🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">数据集</a> | 🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">预训练模型</a> </p> <p align="center"> <img src="assets/groundcua-hq.png" width="100%" alt="GroundCUA 概览"> </p> # GroundCUA 数据集 GroundCUA是一个大规模、多样化的真实UI截图数据集,搭配结构化标注,用于构建多模态计算机交互智能体。该数据集覆盖87个软件平台,涵盖生产力工具、浏览器、创意工具、通讯应用、开发环境与系统工具等场景。GroundCUA专为GUI接地(GUI Grounding)、UI感知以及人机交互的视觉-语言-动作模型研究设计。 --- ## 数据集亮点 - **87个软件平台**:覆盖Windows、macOS、Linux及跨平台应用 - **UI元素标注**:为UI元素提供边界框、可见文本与粗粒度语义类别的标注 - **SHA-256哈希配对**:截图与JSON标注文件通过SHA-256哈希值实现一一配对 - **研究支持**:支持GUI接地、多模态智能体与UI理解相关研究 - **MIT许可证**:可广泛用于学术与开源场景 --- ## 数据集结构 GroundCUA/ ├── data/ # JSON 标注文件目录 ├── images/ # 截图文件目录 └── README.md # 项目说明文档 ### 目录布局 每个软件平台在`data/`与`images/`下均对应同名子目录。 - `data/PlatformName/`:存放对应平台的JSON格式标注文件 - `images/PlatformName/`:存放对应平台的PNG格式截图 截图与标注文件共享相同的SHA-256哈希值。 --- ## 文件命名规范 每张截图均配有同名哈希值的标注文件,格式如下: - `data/PlatformName/[hash].json` - `images/PlatformName/[hash].png` 该命名结构可实现: - 为每张截图分配唯一标识符 - 快速实现图片与标注文件的配对 - 兼容基于哈希寻址的处理流水线 --- ## 标注格式 每个标注文件为UI元素条目列表,用于描述截图中的可见元素。示例格式如下: json [ { "image_path": "PlatformName/screenshot_hash.png", "bbox": [x1, y1, x2, y2], "text": "UI元素文本内容", "category": "元素类别", "id": "唯一标识符" } ] ### 字段说明 **image_path**:截图的相对路径 **bbox**:像素空间内的边界框坐标 `[x1, y1, x2, y2]` **text**:UI元素的可见文本或简短描述 **category**:粗粒度UI类型标签,仅部分元素提供该字段 **id**:标注条目的唯一标识符 --- ## UI元素类别 类别为近似分类,并非所有UI元素均提供类别标注。常见类别示例包括: - **按钮(Button)** - **菜单(Menu)** - **输入控件(Input Elements)** - **导航控件(Navigation)** - **侧边栏(Sidebar)** - **视觉元素(Visual Elements)** - **信息展示区(Information Display)** - **其他(Others)** 上述标签仅为UI接地任务提供轻量级结构支撑,并未形成完整的本体体系。 --- ## 典型应用场景 GroundCUA可应用于以下场景: - 训练计算机交互智能体以感知并理解UI布局 - 构建面向视觉-语言-动作(Vision-Language-Action, VLA)智能体的GUI接地模块 - 预训练屏幕解析与UI元素检测器 - 基准测试OCR(光学字符识别)、布局分析与跨平台UI解析能力 - 开发将UI区域映射至自然语言或动作的模型 --- ## 引用格式 如果在研究中使用GroundCUA,请引用以下文献: bibtex @misc{feizi2025groundingcomputeruseagents, title={Grounding Computer Use Agents on Human Demonstrations}, author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar}, year={2025}, eprint={2511.07332}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2511.07332}, } ## 许可证 GroundCUA采用MIT许可证发布。使用者需确保其使用符合所有适用法律法规与政策要求。
提供机构:
JasonMun7
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作