five

cua-lite/OS-Atlas

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cua-lite/OS-Atlas
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other tags: - cua-lite - gui - sft task_categories: - image-text-to-text configs: - config_name: default data_files: - split: train path: - "*/*/[t]rain.parquet" - "*/*/train/*.parquet" - "*/*/train/*/*.parquet" - split: validation path: - "*/*/[v]alidation.parquet" - "*/*/validation/*.parquet" - "*/*/validation/*/*.parquet" - config_name: desktop data_files: - split: train path: - "desktop/*/[t]rain.parquet" - "desktop/*/train/*.parquet" - "desktop/*/train/*/*.parquet" - split: validation path: - "desktop/*/[v]alidation.parquet" - "desktop/*/validation/*.parquet" - "desktop/*/validation/*/*.parquet" - config_name: mobile data_files: - split: train path: - "mobile/*/[t]rain.parquet" - "mobile/*/train/*.parquet" - "mobile/*/train/*/*.parquet" - split: validation path: - "mobile/*/[v]alidation.parquet" - "mobile/*/validation/*.parquet" - "mobile/*/validation/*/*.parquet" - config_name: web data_files: - split: train path: - "web/*/[t]rain.parquet" - "web/*/train/*.parquet" - "web/*/train/*/*.parquet" - split: validation path: - "web/*/[v]alidation.parquet" - "web/*/validation/*.parquet" - "web/*/validation/*/*.parquet" - config_name: desktop-grounding-bbox data_files: - split: train path: - "desktop/grounding-bbox/[t]rain.parquet" - "desktop/grounding-bbox/train/*.parquet" - "desktop/grounding-bbox/train/*/*.parquet" - split: validation path: - "desktop/grounding-bbox/[v]alidation.parquet" - "desktop/grounding-bbox/validation/*.parquet" - "desktop/grounding-bbox/validation/*/*.parquet" - config_name: mobile-grounding-bbox data_files: - split: train path: - "mobile/grounding-bbox/[t]rain.parquet" - "mobile/grounding-bbox/train/*.parquet" - "mobile/grounding-bbox/train/*/*.parquet" - split: validation path: - "mobile/grounding-bbox/[v]alidation.parquet" - "mobile/grounding-bbox/validation/*.parquet" - "mobile/grounding-bbox/validation/*/*.parquet" - config_name: web-grounding-bbox data_files: - split: train path: - "web/grounding-bbox/[t]rain.parquet" - "web/grounding-bbox/train/*.parquet" - "web/grounding-bbox/train/*/*.parquet" - split: validation path: - "web/grounding-bbox/[v]alidation.parquet" - "web/grounding-bbox/validation/*.parquet" - "web/grounding-bbox/validation/*/*.parquet" --- # cua-lite/OS-Atlas cua-lite preprocessed version of OS-Atlas (OS-Copilot/OS-Atlas-data). grounding:bbox across three platforms and five sub-sources: desktop (windows, linux, macos), mobile (amex), web (fineweb). ## Origin - [https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) ## Load via `datasets` ```python from datasets import load_dataset # entire dataset ds = load_dataset("cua-lite/OS-Atlas") # just one platform ds = load_dataset("cua-lite/OS-Atlas", "desktop") # just one (platform, task_type) cohort ds = load_dataset("cua-lite/OS-Atlas", "desktop-grounding-bbox") ``` You can also filter by `metadata.platform` / `metadata.task_type` / `metadata.others.*` after loading; every row carries a rich `metadata` struct (see schema below). ## Schema Each row has these columns: | column | type | notes | |---|---|---| | `image_ids` | list[string] | content-addressed ids (`<sha256>.<ext>`), enables cross-parquet / cross-dataset dedup | | `images` | list[Image] | bytes embedded at HF push time; matches `image_ids` index-for-index | | `messages` | list[struct] | OpenAI-style turns with `role` + structured `content` | | `metadata` | struct | `{platform, task_type, split, others{...}}` | Coordinate values in `messages` are normalized to `[0, 1000]` integers. ## Layout ``` <platform>/<task_type>/<split>.parquet # single-variant cohort <platform>/<task_type>/<split>/<variant>.parquet # multi-variant cohort <platform>/<task_type>/<split>/shard-NNNNN-of-NNNNN.parquet # + sharded single-variant <platform>/<task_type>/<split>/<variant>/shard-NNNNN-of-NNNNN.parquet # + sharded multi-variant ``` - `platform` ∈ {desktop, mobile, web} - `task_type` directory uses a hyphen where the metadata value uses a colon: `grounding-action/` → `grounding:action` - `split` ∈ {train, validation} — `validation` is an in-distribution held-out slice (never used in training); `test` is reserved for out-of-distribution benchmark datasets ## Stats | platform | task_type | variant | train | validation | |---|---|---|---:|---:| | desktop | grounding:bbox | linux | 42,327 | 817 | | desktop | grounding:bbox | macos | 17,958 | 440 | | desktop | grounding:bbox | windows | 1,073,175 | 2,000 | | mobile | grounding:bbox | amex | 1,200,434 | 2,000 | | mobile | grounding:bbox | aw | 88,078 | 1,753 | | mobile | grounding:bbox | ricosca | 169,858 | 3,417 | | mobile | grounding:bbox | uibert | 16,353 | 307 | | mobile | grounding:bbox | widget | 99,425 | 2,000 | | web | grounding:bbox | fineweb | 6,639,491 | 2,000 | | web | grounding:bbox | seeclick | 2,112,523 | 2,000 | ## Image storage Images are content-addressed by SHA-256 and deduplicated within this repo. The `images` column on HuggingFace embeds raw bytes so the Hub viewer renders thumbnails and `datasets.load_dataset` works out of the box. For local workflows (SFT export, cross-dataset dedup, split rebalancing), run [`reverse.py`](https://github.com/cua-lite/cua-lite/tree/main/scripts/hf_upload) on a cloned repo: it extracts each unique `image_id` once to a shared `image_store/<hash[:2]>/<hash>.<ext>` and rewrites the parquets to drop the `images` column, so rows reference images by hash id only. The shared store is reusable across datasets — the same image in two repos lands in one file. - Total unique images: **1,312,118** - Store size: **496.53 GB** ## Notes Images are heavily reused: each screenshot is typically referenced by dozens of bbox labels. Content-addressed storage collapses this to a far smaller unique-image count than the ~3.58M row count suggests. ## License & citation See original dataset (OS-Copilot/OS-Atlas-data) See https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data
提供机构:
cua-lite
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作