OS-Atlas-data
收藏魔搭社区2026-01-07 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/OS-Copilot/OS-Atlas-data
下载链接
链接失效反馈官方服务:
资源简介:
# GUI Grounding Pre-training Data for OS-ATLAS
This document describes the acquisition of the pre-training data used by OS-ATLAS [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://huggingface.co/papers/2410.23218).
<div align="center">
[\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
</div>

**Notes:** In GUI grounding data, the position of the target element is recorded in the `bbox` key, represented by `[left, top, right, bottom]`.
Each value is a [0, 1] decimal number indicating the ratio of the corresponding position to the width or height of the image.
The data stored in this dataset consists of raw data containing **only** element grounding information. When training a model, you need to use the corresponding prompts to wrap these data.
The data we released is divided into three domains: mobile, desktop and web.
All annotation data is stored in JSON format and each sample contains:
* `img_filename`: the interface screenshot file
* `instruction`: human instruction or referring expression extracted from ally tree or html
* `bbox`: the bounding box of the target element corresponding to instruction
Some data also contains a `data_type`, which records the type of an element in its structured information, if it can be obtained.
***
### Mobile data
This part of data is stored under the *mobile_domain* directory. Our mobile grounding data consists of four parts.
#### AMEX
Android Multi-annotation EXpo (AMEX) is a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents [1].
The annotation data is stored in
-`amex_raw.json`
Due to the single file size limitation of Hugging Face datasets, we stored the Amex images in *zip* format and split them into several sub-files.
- `amex_images_part_aa`
- `amex_images_part_ab`
- `amex_images_part_ac`
You need to first merge these split files back into the original file and then extract the contents.
```
cat amex_images_part_* > amex_images.zip
7z x amex_images.zip -aoa -o/path/to/extract/folder
```
#### UIBert
UIBert [2] is a dataset extended from Rico dataset [3] for two tasks: similar UI component retrieval and referring expression component retrieval.
The annotation data is stored in
- `uibert_raw.json`
The UIBert images are stored in
- `UIBert.zip`
#### Widget Captioning and RICOSCA
Widget Captioning data are collected by [4].
RICOSCA is a dataset automatically labeled using Android VH in [5]
The annotation data is stored in
- `widget_captioning.json`
- `ricosca.json`
The rico images are stored in
- `rico_imgs.zip`
#### Android_world_data
This part of data are sampled from a android environment for building and benchmarking autonomous computer control agents [6].
The annotation data is stored in
- `aw_mobile.json`
The rico images are stored in
- `mobile_images.zip`
***
### Desktop data
This part of data is stored under the *desktop_domain* directory.
All of the desktop grounding data is collected from the real environments of personal computers running different operating systems. Each image is split into multiple sub-images to enhance data diversity.
Our desktop grounding data consists of three parts: Windows, Linux and MacOS.
**The image and annotation data for each operating system are stored in corresponding zip and json files.**
It is worth noting that, due to the large size of the Windows image data, the split files need to be merged before extraction.
```
cat windows_image_part_* > windows_images.zip
7z x windows_images.zip -aoa -o/path/to/extract/folder
```
***
### Web data
This part of data is stored under the *web_domain* directory.
Our desktop grounding data consists of two parts.
#### Seeclick web data
The web data from SeeClick [7] was crawled from websites provided by Common Crawl, containing more than 270k webpage screenshots and over 3 million webpage elements.
The annotation data is stored in
- `seeclick_web.json`
The images are stored into split files and need to be merged before extraction.
```
cat seeclick_web_image_part_* > seeclick_web_images.zip
7z x seeclick_web_images.zip -aoa -o/path/to/extract/folder
```
#### Fineweb_crawled_data
This part of data is crawled from web pages from the latest URLs obtained from FineWeb [8], a cleaned and deduplicated English dataset derived from Common Crawl.
Since this portion of the data contains at least 1.6 million images, we have compressed them into 10 zip files, from `fineweb_3m_s11.zip` to `fineweb_3m_s52.zip`.
Please extract them into the same directory.
As an example,
```
7z x fineweb_3m_s11.zip -aoa -o/same/path/to/extract/fineweb
```
The annotation data is stored in
- `fineweb_3m.json`
***
### Best practice
During the training of **OS-Atlas-4B**, we randomly sampled predefined prompts to wrap the grounding data. Additionally, we scaled the relative coordinates of each element (in the range [0, 1]) by multiplying them by 1000 before inputting them into the model for training.
Below is an example of a data entry:
```
{
"conversations": [
{
"from": "human",
"value": "<image>\nUsing the provided screenshot, I'll describe webpage elements for you to locate (with bbox).\n<ref>media-maniacs.org</ref>\n<ref>Home</ref>\n<ref>Sitemap</ref>\n<ref>shop you can have what you choose 2012</ref>"
},
{
"from": "gpt",
"value": "<ref>media-maniacs.org</ref><box>[[70,856,141,871]]</box>\n<ref>Home</ref><box>[[21,935,43,951]]</box>\n<ref>Sitemap</ref><box>[[21,919,52,934]]</box>\n<ref>shop you can have what you choose 2012</ref><box>[[368,839,523,855]]</box>"
}
]
}
```
**OS-Atlas-7B**
```
{
"conversations": [
{
"from": "human",
"value": "<image>\nUsing the provided screenshot, I'll describe webpage elements for you to locate (with bbox).\n<|object_ref_start|>Facebook<|object_ref_end|>\n<|object_ref_start|>Subscribe<|object_ref_end|>\n<|object_ref_start|>Twitter<|object_ref_end|>\n<|object_ref_start|>Read More<|object_ref_end|>\n<|object_ref_start|>Read More<|object_ref_end|>"
},
{
"from": "gpt",
"value": "<|object_ref_start|>Facebook<|object_ref_end|><|box_start|>(4,955),(36,970)<|box_end|>\n<|object_ref_start|>Subscribe<|object_ref_end|><|box_start|>(4,913),(43,932)<|box_end|>\n<|object_ref_start|>Twitter<|object_ref_end|><|box_start|>(39,955),(62,970)<|box_end|>\n<|object_ref_start|>Read More<|object_ref_end|><|box_start|>(30,138),(73,157)<|box_end|>\n<|object_ref_start|>Read More<|object_ref_end|><|box_start|>(30,139),(73,155)<|box_end|>"
}
]
}
```
The prompts we used are stored in `prompts.json`.
***
**The following are the open-source datasets we used as data sources. We welcome everyone to check the details and cite these sources accordingly!**
[1] [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://arxiv.org/abs/2407.17490)
[2] [UIBert: Learning Generic Multimodal Representations for UI Understanding](https://arxiv.org/abs/2107.13731)
[3] [Rico: A mobile app dataset for building data-driven design applications](https://dl.acm.org/doi/pdf/10.1145/3126594.3126651)
[4] [Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements](https://arxiv.org/pdf/2010.04295.pdf)
[5] [Mapping Natural Language Instructions to Mobile UI Action Sequences](https://arxiv.org/pdf/2005.03776)
[6] [ANDROIDWORLD: A Dynamic Benchmarking Environment for Autonomous Agents](https://arxiv.org/abs/2405.14573)
[7] [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
[8] [The fineweb datasets: Decanting the web for the finest text data at scale](https://arxiv.org/abs/2406.17557)
# 适用于OS-ATLAS的GUI接地(GUI Grounding)预训练数据
本文档介绍了OS-ATLAS[OS-ATLAS:通用GUI智能体(AI Agent)基础动作模型](https://huggingface.co/papers/2410.23218)所使用的预训练数据的采集流程。
<div align="center">
[[🏠主页]](https://osatlas.github.io) [[💻代码]](https://github.com/OS-Copilot/OS-Atlas) [[🚀快速开始]](#quick-start) [[📝论文]](https://arxiv.org/abs/2410.23218) [[🤗模型]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045) [[🤗ScreenSpot-v2]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
</div>

**注意:** 在GUI接地数据中,目标元素的位置以`bbox`(边界框,bounding box的缩写)键记录,采用`[左, 上, 右, 下]`的形式表示。每个取值均为范围在[0, 1]内的小数,代表对应位置相对于图像宽或高的比例。
本数据集存储的数据仅包含元素接地信息的原始数据。在模型训练时,需使用对应的提示词对这些数据进行封装。
我们发布的数据分为三个领域:移动端、桌面端与网页端。
所有标注数据均以JSON格式存储,每个样本包含以下字段:
* `img_filename`:界面截图文件名称
* `instruction`:从无障碍树(ally tree)或HTML中提取的人类指令或指代表达式
* `bbox`:对应指令的目标元素的边界框。
部分数据还包含`data_type`字段,若可获取,则记录该元素在其结构化信息中的类型。
***
### 移动端数据
本部分数据存储于`mobile_domain`目录下。我们的移动端接地数据包含四个子部分。
#### AMEX
Android多标注博览会(Android Multi-annotation EXpo, AMEX)是一款面向通用移动端GUI控制智能体的大规模综合数据集[1]。
标注数据存储于:
- `amex_raw.json`
由于Hugging Face数据集存在单文件大小限制,我们将Amex数据集的图像以zip格式存储并拆分为多个子文件:
- `amex_images_part_aa`
- `amex_images_part_ab`
- `amex_images_part_ac`
您需先将这些拆分文件合并为原始文件,再进行解压:
cat amex_images_part_* > amex_images.zip
7z x amex_images.zip -aoa -o/path/to/extract/folder
#### UIBert
UIBert[2]是从Rico数据集[3]扩展而来的数据集,用于两项任务:相似UI组件检索与指代表达式组件检索。
标注数据存储于:
- `uibert_raw.json`
UIBert数据集的图像存储于:
- `UIBert.zip`
#### Widget Captioning与RICOSCA
Widget Captioning数据由[4]收集。
RICOSCA是在[5]中使用Android VH自动标注的数据集。
标注数据存储于:
- `widget_captioning.json`
- `ricosca.json`
Rico数据集的图像存储于:
- `rico_imgs.zip`
#### Android_world_data
本部分数据从用于构建与评测自主计算机控制智能体的安卓环境[6]中采样得到。
标注数据存储于:
- `aw_mobile.json`
相关图像存储于:
- `mobile_images.zip`
***
### 桌面端数据
本部分数据存储于`desktop_domain`目录下。
所有桌面端接地数据均从运行不同操作系统的个人计算机真实环境中采集。每张图像被拆分为多个子图像以提升数据多样性。
我们的桌面端接地数据包含三个子部分:Windows、Linux与MacOS。
**各操作系统的图像与标注数据分别存储于对应的zip压缩包与JSON文件中。**
值得注意的是,由于Windows图像数据体量较大,需先合并拆分文件再进行解压:
cat windows_image_part_* > windows_images.zip
7z x windows_images.zip -aoa -o/path/to/extract/folder
***
### 网页端数据
本部分数据存储于`web_domain`目录下。
我们的网页端接地数据包含两个子部分。
#### Seeclick网页数据
来自SeeClick[7]的网页数据从Common Crawl提供的网站中爬取得到,包含超过27万张网页截图与300余万个网页元素。
标注数据存储于:
- `seeclick_web.json`
图像被拆分为多个子文件,需先合并再解压:
cat seeclick_web_image_part_* > seeclick_web_images.zip
7z x seeclick_web_images.zip -aoa -o/path/to/extract/folder
#### Fineweb爬取数据
本部分数据从Fineweb[8]提供的最新URL对应的网页中爬取得到,Fineweb是一个从Common Crawl衍生而来的经过清洗与去重的英文数据集。
由于本部分数据包含至少160万张图像,我们将其压缩为10个zip文件,从`fineweb_3m_s11.zip`到`fineweb_3m_s52.zip`。
请将这些文件解压至同一目录下。示例如下:
7z x fineweb_3m_s11.zip -aoa -o/same/path/to/extract/fineweb
标注数据存储于:
- `fineweb_3m.json`
***
### 最佳实践
在**OS-Atlas-4B**的训练过程中,我们随机采样预定义的提示词对接地数据进行封装。此外,在将元素的相对坐标(范围为[0, 1])输入模型进行训练前,我们将其缩放至原范围的1000倍。
以下为一条数据条目的示例:
{
"conversations": [
{
"from": "human",
"value": "<image>
Using the provided screenshot, I'll describe webpage elements for you to locate (with bbox).
<ref>media-maniacs.org</ref>
<ref>Home</ref>
<ref>Sitemap</ref>
<ref>shop you can have what you choose 2012</ref>"
},
{
"from": "gpt",
"value": "<ref>media-maniacs.org</ref><box>[[70,856,141,871]]</box>
<ref>Home</ref><box>[[21,935,43,951]]</box>
<ref>Sitemap</ref><box>[[21,919,52,934]]</box>
<ref>shop you can have what you choose 2012</ref><box>[[368,839,523,855]]</box>"
}
]
}
**OS-Atlas-7B**
{
"conversations": [
{
"from": "human",
"value": "<image>
Using the provided screenshot, I'll describe webpage elements for you to locate (with bbox).
<|object_ref_start|>Facebook<|object_ref_end|>
<|object_ref_start|>Subscribe<|object_ref_end|>
<|object_ref_start|>Twitter<|object_ref_end|>
<|object_ref_start|>Read More<|object_ref_end|>
<|object_ref_start|>Read More<|object_ref_end|>"
},
{
"from": "gpt",
"value": "<|object_ref_start|>Facebook<|object_ref_end|><|box_start|>(4,955),(36,970)<|box_end|>
<|object_ref_start|>Subscribe<|object_ref_end|><|box_start|>(4,913),(43,932)<|box_end|>
<|object_ref_start|>Twitter<|object_ref_end|><|box_start|>(39,955),(62,970)<|box_end|>
<|object_ref_start|>Read More<|object_ref_end|><|box_start|>(30,138),(73,157)<|box_end|>
<|object_ref_start|>Read More<|object_ref_end|><|box_start|>(30,139),(73,155)<|box_end|>"
}
]
}
我们使用的提示词存储于`prompts.json`文件中。
***
**以下为我们用作数据源的开源数据集,欢迎各位查阅详情并引用相关来源!**
[1] [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents](https://arxiv.org/abs/2407.17490)
[2] [UIBert: Learning Generic Multimodal Representations for UI Understanding](https://arxiv.org/abs/2107.13731)
[3] [Rico: A mobile app dataset for building data-driven design applications](https://dl.acm.org/doi/pdf/10.1145/3126594.3126651)
[4] [Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements](https://arxiv.org/pdf/2010.04295.pdf)
[5] [Mapping Natural Language Instructions to Mobile UI Action Sequences](https://arxiv.org/pdf/2005.03776)
[6] [ANDROIDWORLD: A Dynamic Benchmarking Environment for Autonomous Agents](https://arxiv.org/abs/2405.14573)
[7] [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
[8] [The fineweb datasets: Decanting the web for the finest text data at scale](https://arxiv.org/abs/2406.17557)
提供机构:
maas
创建时间:
2025-01-08



