cua-lite/Jedi
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cua-lite/Jedi
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
tags:
- cua-lite
- gui
- sft
task_categories:
- image-text-to-text
configs:
- config_name: default
data_files:
- split: train
path:
- "*/*/train*parquet"
- "*/*/train/*.parquet"
- "*/*/train/*/*.parquet"
- split: validation
path:
- "*/*/validation*parquet"
- "*/*/validation/*.parquet"
- "*/*/validation/*/*.parquet"
- config_name: desktop-grounding-bbox
data_files:
- split: train
path:
- "desktop/grounding-bbox/train*parquet"
- "desktop/grounding-bbox/train/*.parquet"
- "desktop/grounding-bbox/train/*/*.parquet"
- split: validation
path:
- "desktop/grounding-bbox/validation*parquet"
- "desktop/grounding-bbox/validation/*.parquet"
- "desktop/grounding-bbox/validation/*/*.parquet"
- config_name: desktop-grounding-point
data_files:
- split: train
path:
- "desktop/grounding-point/train*parquet"
- "desktop/grounding-point/train/*.parquet"
- "desktop/grounding-point/train/*/*.parquet"
- split: validation
path:
- "desktop/grounding-point/validation*parquet"
- "desktop/grounding-point/validation/*.parquet"
- "desktop/grounding-point/validation/*/*.parquet"
- config_name: desktop-understanding
data_files:
- split: train
path:
- "desktop/understanding/train*parquet"
- "desktop/understanding/train/*.parquet"
- "desktop/understanding/train/*/*.parquet"
- split: validation
path:
- "desktop/understanding/validation*parquet"
- "desktop/understanding/validation/*.parquet"
- "desktop/understanding/validation/*/*.parquet"
---
# cua-lite/Jedi
cua-lite preprocessed version of Jedi (xlangai/Jedi). Desktop GUI data covering three task types: understanding (icon captioning, layout description), grounding:bbox (layout regions), grounding:point (icon centers).
## Origin
- [https://huggingface.co/datasets/xlangai/Jedi](https://huggingface.co/datasets/xlangai/Jedi)
## Load via `datasets`
```python
from datasets import load_dataset
# entire dataset
ds = load_dataset("cua-lite/Jedi")
# just one (platform, task_type) cohort
ds = load_dataset("cua-lite/Jedi", "desktop-grounding-bbox")
```
You can also filter by `metadata.platform` / `metadata.task_type` /
`metadata.others.*` after loading; every row carries a rich `metadata`
struct (see schema below).
## Schema
Each row has these columns:
| column | type | notes |
|---|---|---|
| `image_ids` | list[string] | content-addressed ids (`<sha256>.<ext>`), enables cross-parquet / cross-dataset dedup |
| `images` | list[Image] | bytes embedded at HF push time; matches `image_ids` index-for-index |
| `messages` | list[struct] | OpenAI-style turns with `role` + structured `content` |
| `metadata` | struct | `{platform, task_type, split, others{...}}` |
Coordinate values in `messages` are normalized to `[0, 1000]` integers.
## Layout
```
<platform>/<task_type>/<split>.parquet # single-variant cohort
<platform>/<task_type>/<split>/<variant>.parquet # multi-variant cohort
<platform>/<task_type>/<split>/shard-NNNNN-of-NNNNN.parquet # + sharded single-variant
<platform>/<task_type>/<split>/<variant>/shard-NNNNN-of-NNNNN.parquet # + sharded multi-variant
```
- `platform` ∈ {desktop, mobile, web}
- `task_type` directory uses a hyphen where the metadata value uses a colon: `grounding-action/` → `grounding:action`
- `split` ∈ {train, validation} — `validation` is an in-distribution held-out slice (never used in training); `test` is reserved for out-of-distribution benchmark datasets
## Stats
| platform | task_type | variant | train | validation |
|---|---|---|---:|---:|
| desktop | grounding:bbox | bbox | 1,966,349 | 2,000 |
| desktop | grounding:point | point | 178,297 | 2,000 |
| desktop | understanding | icon_caption | 379,462 | 2,000 |
| desktop | understanding | layout | 849,696 | 2,000 |
## Image storage
Images are content-addressed by SHA-256 and deduplicated within this repo.
The `images` column on HuggingFace embeds raw bytes so the Hub viewer
renders thumbnails and `datasets.load_dataset` works out of the box.
For local workflows (SFT export, cross-dataset dedup, split rebalancing),
run [`reverse.py`](https://github.com/cua-lite/cua-lite/tree/main/scripts/hf_upload)
on a cloned repo: it extracts each unique `image_id` once to a shared
`image_store/<hash[:2]>/<hash>.<ext>` and rewrites the parquets to drop
the `images` column, so rows reference images by hash id only. The shared
store is reusable across datasets — the same image in two repos lands in
one file.
- Total unique images: **431,898**
- Store size: **58.18 GB**
## Notes
_(none)_
## License & citation
See original dataset (xlangai/Jedi)
See https://huggingface.co/datasets/xlangai/Jedi
许可证:其他
标签:
- cua-lite
- 图形用户界面(Graphical User Interface)
- 监督微调(Supervised Fine-Tuning)
任务类别:
- 图像-文本转文本
配置项:
- 配置名称:default
数据文件:
- 划分集:train
路径:
- "*/*/train*parquet"
- "*/*/train/*.parquet"
- "*/*/train/*/*.parquet"
- 划分集:validation
路径:
- "*/*/validation*parquet"
- "*/*/validation/*.parquet"
- "*/*/validation/*/*.parquet"
- 配置名称:desktop-grounding-bbox
数据文件:
- 划分集:train
路径:
- "desktop/grounding-bbox/train*parquet"
- "desktop/grounding-bbox/train/*.parquet"
- "desktop/grounding-bbox/train/*/*.parquet"
- 划分集:validation
路径:
- "desktop/grounding-bbox/validation*parquet"
- "desktop/grounding-bbox/validation/*.parquet"
- "desktop/grounding-bbox/validation/*/*.parquet"
- 配置名称:desktop-grounding-point
数据文件:
- 划分集:train
路径:
- "desktop/grounding-point/train*parquet"
- "desktop/grounding-point/train/*.parquet"
- "desktop/grounding-point/train/*/*.parquet"
- 划分集:validation
路径:
- "desktop/grounding-point/validation*parquet"
- "desktop/grounding-point/validation/*.parquet"
- "desktop/grounding-point/validation/*/*.parquet"
- 配置名称:desktop-understanding
数据文件:
- 划分集:train
路径:
- "desktop/understanding/train*parquet"
- "desktop/understanding/train/*.parquet"
- "desktop/understanding/train/*/*.parquet"
- 划分集:validation
路径:
- "desktop/understanding/validation*parquet"
- "desktop/understanding/validation/*.parquet"
- "desktop/understanding/validation/*/*.parquet"
# cua-lite/Jedi 数据集
本数据集是Jedi(xlangai/Jedi)经cua-lite预处理后的版本,涵盖三类桌面图形用户界面(Graphical User Interface)任务数据:理解任务(图标描述、布局描述)、边界框(bounding box)定位任务(布局区域定位)、点定位任务(图标中心定位)。
## 数据集来源
- [https://huggingface.co/datasets/xlangai/Jedi](https://huggingface.co/datasets/xlangai/Jedi)
## 加载方式(通过`datasets`库)
python
from datasets import load_dataset
# 加载完整数据集
ds = load_dataset("cua-lite/Jedi")
# 仅加载指定(平台、任务类型)的子数据集
ds = load_dataset("cua-lite/Jedi", "desktop-grounding-bbox")
加载完成后,还可通过`metadata.platform`、`metadata.task_type`或`metadata.others.*`字段进行筛选;每条数据均包含结构完整的元数据(metadata)结构体(详见下文的数据集架构)。
## 数据集架构
每条数据包含以下列:
| 列名 | 数据类型 | 说明 |
|---|---|---|
| `image_ids` | list[string] | 内容寻址ID(格式为`<sha256>.<ext>`),支持跨Parquet文件、跨数据集去重 |
| `images` | list[Image] | Hugging Face平台上传时嵌入的原始图像字节数据,与`image_ids`按索引一一对应 |
| `messages` | list[struct] | 遵循OpenAI格式的对话轮次,包含`role`字段与结构化`content`字段 |
| `metadata` | struct | 结构体格式为`{platform, task_type, split, others{...}}` |
`messages`中的坐标值均已归一化为`[0, 1000]`范围内的整数。
## 文件布局
<platform>/<task_type>/<split>.parquet # 单变体子数据集
<platform>/<task_type>/<split>/<variant>.parquet # 多变体子数据集
<platform>/<task_type>/<split>/shard-NNNNN-of-NNNNN.parquet # 分片单变体子数据集
<platform>/<task_type>/<split>/<variant>/shard-NNNNN-of-NNNNN.parquet # 分片多变体子数据集
- `platform`(平台)可选值为:桌面端(desktop)、移动端(mobile)、网页端(web)
- 任务类型(task_type)目录使用连字符替代元数据中的冒号:例如`grounding-action/`对应元数据中的`grounding:action`
- `split`(划分集)可选值为:训练集(train)、验证集(validation)——验证集为分布内预留子集,不参与模型训练;测试集(test)预留用于分布外基准测试
## 数据集统计
| 平台 | 任务类型 | 变体 | 训练集样本数 | 验证集样本数 |
|---|---|---|---:|---:|
| desktop | grounding:bbox | bbox | 1,966,349 | 2,000 |
| desktop | grounding:point | point | 178,297 | 2,000 |
| desktop | understanding | icon_caption | 379,462 | 2,000 |
| desktop | understanding | layout | 849,696 | 2,000 |
## 图像存储
本数据集内的图像通过SHA-256哈希进行内容寻址,并支持数据集内去重。Hugging Face平台存储的`images`列嵌入了原始图像字节数据,因此Hub页面可直接生成缩略图,且`datasets.load_dataset`可直接加载使用。
针对本地工作流(如监督微调(Supervised Fine-Tuning)导出、跨数据集去重、数据集划分重平衡),可在克隆的仓库中运行[`reverse.py`](https://github.com/cua-lite/cua-lite/tree/main/scripts/hf_upload)脚本:该脚本会将每个唯一的`image_id`提取至共享存储目录`image_store/<hash[:2]>/<hash>.<ext>`,并重写Parquet文件以移除`images`列,后续数据行仅通过哈希ID引用图像。该共享存储可跨数据集复用——两个仓库中的同一图像仅会存储为一个文件。
- 唯一图像总数:**431,898**
- 存储总大小:**58.18 GB**
## 补充说明
无
## 许可证与引用
请参阅原始数据集(xlangai/Jedi)
请参阅:https://huggingface.co/datasets/xlangai/Jedi
提供机构:
cua-lite



