five

cua-lite/Jedi

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cua-lite/Jedi
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other tags: - cua-lite - gui - sft task_categories: - image-text-to-text configs: - config_name: default data_files: - split: train path: - "*/*/train*parquet" - "*/*/train/*.parquet" - "*/*/train/*/*.parquet" - split: validation path: - "*/*/validation*parquet" - "*/*/validation/*.parquet" - "*/*/validation/*/*.parquet" - config_name: desktop-grounding-bbox data_files: - split: train path: - "desktop/grounding-bbox/train*parquet" - "desktop/grounding-bbox/train/*.parquet" - "desktop/grounding-bbox/train/*/*.parquet" - split: validation path: - "desktop/grounding-bbox/validation*parquet" - "desktop/grounding-bbox/validation/*.parquet" - "desktop/grounding-bbox/validation/*/*.parquet" - config_name: desktop-grounding-point data_files: - split: train path: - "desktop/grounding-point/train*parquet" - "desktop/grounding-point/train/*.parquet" - "desktop/grounding-point/train/*/*.parquet" - split: validation path: - "desktop/grounding-point/validation*parquet" - "desktop/grounding-point/validation/*.parquet" - "desktop/grounding-point/validation/*/*.parquet" - config_name: desktop-understanding data_files: - split: train path: - "desktop/understanding/train*parquet" - "desktop/understanding/train/*.parquet" - "desktop/understanding/train/*/*.parquet" - split: validation path: - "desktop/understanding/validation*parquet" - "desktop/understanding/validation/*.parquet" - "desktop/understanding/validation/*/*.parquet" --- # cua-lite/Jedi cua-lite preprocessed version of Jedi (xlangai/Jedi). Desktop GUI data covering three task types: understanding (icon captioning, layout description), grounding:bbox (layout regions), grounding:point (icon centers). ## Origin - [https://huggingface.co/datasets/xlangai/Jedi](https://huggingface.co/datasets/xlangai/Jedi) ## Load via `datasets` ```python from datasets import load_dataset # entire dataset ds = load_dataset("cua-lite/Jedi") # just one (platform, task_type) cohort ds = load_dataset("cua-lite/Jedi", "desktop-grounding-bbox") ``` You can also filter by `metadata.platform` / `metadata.task_type` / `metadata.others.*` after loading; every row carries a rich `metadata` struct (see schema below). ## Schema Each row has these columns: | column | type | notes | |---|---|---| | `image_ids` | list[string] | content-addressed ids (`<sha256>.<ext>`), enables cross-parquet / cross-dataset dedup | | `images` | list[Image] | bytes embedded at HF push time; matches `image_ids` index-for-index | | `messages` | list[struct] | OpenAI-style turns with `role` + structured `content` | | `metadata` | struct | `{platform, task_type, split, others{...}}` | Coordinate values in `messages` are normalized to `[0, 1000]` integers. ## Layout ``` <platform>/<task_type>/<split>.parquet # single-variant cohort <platform>/<task_type>/<split>/<variant>.parquet # multi-variant cohort <platform>/<task_type>/<split>/shard-NNNNN-of-NNNNN.parquet # + sharded single-variant <platform>/<task_type>/<split>/<variant>/shard-NNNNN-of-NNNNN.parquet # + sharded multi-variant ``` - `platform` ∈ {desktop, mobile, web} - `task_type` directory uses a hyphen where the metadata value uses a colon: `grounding-action/` → `grounding:action` - `split` ∈ {train, validation} — `validation` is an in-distribution held-out slice (never used in training); `test` is reserved for out-of-distribution benchmark datasets ## Stats | platform | task_type | variant | train | validation | |---|---|---|---:|---:| | desktop | grounding:bbox | bbox | 1,966,349 | 2,000 | | desktop | grounding:point | point | 178,297 | 2,000 | | desktop | understanding | icon_caption | 379,462 | 2,000 | | desktop | understanding | layout | 849,696 | 2,000 | ## Image storage Images are content-addressed by SHA-256 and deduplicated within this repo. The `images` column on HuggingFace embeds raw bytes so the Hub viewer renders thumbnails and `datasets.load_dataset` works out of the box. For local workflows (SFT export, cross-dataset dedup, split rebalancing), run [`reverse.py`](https://github.com/cua-lite/cua-lite/tree/main/scripts/hf_upload) on a cloned repo: it extracts each unique `image_id` once to a shared `image_store/<hash[:2]>/<hash>.<ext>` and rewrites the parquets to drop the `images` column, so rows reference images by hash id only. The shared store is reusable across datasets — the same image in two repos lands in one file. - Total unique images: **431,898** - Store size: **58.18 GB** ## Notes _(none)_ ## License & citation See original dataset (xlangai/Jedi) See https://huggingface.co/datasets/xlangai/Jedi

许可证:其他 标签: - cua-lite - 图形用户界面(Graphical User Interface) - 监督微调(Supervised Fine-Tuning) 任务类别: - 图像-文本转文本 配置项: - 配置名称:default 数据文件: - 划分集:train 路径: - "*/*/train*parquet" - "*/*/train/*.parquet" - "*/*/train/*/*.parquet" - 划分集:validation 路径: - "*/*/validation*parquet" - "*/*/validation/*.parquet" - "*/*/validation/*/*.parquet" - 配置名称:desktop-grounding-bbox 数据文件: - 划分集:train 路径: - "desktop/grounding-bbox/train*parquet" - "desktop/grounding-bbox/train/*.parquet" - "desktop/grounding-bbox/train/*/*.parquet" - 划分集:validation 路径: - "desktop/grounding-bbox/validation*parquet" - "desktop/grounding-bbox/validation/*.parquet" - "desktop/grounding-bbox/validation/*/*.parquet" - 配置名称:desktop-grounding-point 数据文件: - 划分集:train 路径: - "desktop/grounding-point/train*parquet" - "desktop/grounding-point/train/*.parquet" - "desktop/grounding-point/train/*/*.parquet" - 划分集:validation 路径: - "desktop/grounding-point/validation*parquet" - "desktop/grounding-point/validation/*.parquet" - "desktop/grounding-point/validation/*/*.parquet" - 配置名称:desktop-understanding 数据文件: - 划分集:train 路径: - "desktop/understanding/train*parquet" - "desktop/understanding/train/*.parquet" - "desktop/understanding/train/*/*.parquet" - 划分集:validation 路径: - "desktop/understanding/validation*parquet" - "desktop/understanding/validation/*.parquet" - "desktop/understanding/validation/*/*.parquet" # cua-lite/Jedi 数据集 本数据集是Jedi(xlangai/Jedi)经cua-lite预处理后的版本,涵盖三类桌面图形用户界面(Graphical User Interface)任务数据:理解任务(图标描述、布局描述)、边界框(bounding box)定位任务(布局区域定位)、点定位任务(图标中心定位)。 ## 数据集来源 - [https://huggingface.co/datasets/xlangai/Jedi](https://huggingface.co/datasets/xlangai/Jedi) ## 加载方式(通过`datasets`库) python from datasets import load_dataset # 加载完整数据集 ds = load_dataset("cua-lite/Jedi") # 仅加载指定(平台、任务类型)的子数据集 ds = load_dataset("cua-lite/Jedi", "desktop-grounding-bbox") 加载完成后,还可通过`metadata.platform`、`metadata.task_type`或`metadata.others.*`字段进行筛选;每条数据均包含结构完整的元数据(metadata)结构体(详见下文的数据集架构)。 ## 数据集架构 每条数据包含以下列: | 列名 | 数据类型 | 说明 | |---|---|---| | `image_ids` | list[string] | 内容寻址ID(格式为`<sha256>.<ext>`),支持跨Parquet文件、跨数据集去重 | | `images` | list[Image] | Hugging Face平台上传时嵌入的原始图像字节数据,与`image_ids`按索引一一对应 | | `messages` | list[struct] | 遵循OpenAI格式的对话轮次,包含`role`字段与结构化`content`字段 | | `metadata` | struct | 结构体格式为`{platform, task_type, split, others{...}}` | `messages`中的坐标值均已归一化为`[0, 1000]`范围内的整数。 ## 文件布局 <platform>/<task_type>/<split>.parquet # 单变体子数据集 <platform>/<task_type>/<split>/<variant>.parquet # 多变体子数据集 <platform>/<task_type>/<split>/shard-NNNNN-of-NNNNN.parquet # 分片单变体子数据集 <platform>/<task_type>/<split>/<variant>/shard-NNNNN-of-NNNNN.parquet # 分片多变体子数据集 - `platform`(平台)可选值为:桌面端(desktop)、移动端(mobile)、网页端(web) - 任务类型(task_type)目录使用连字符替代元数据中的冒号:例如`grounding-action/`对应元数据中的`grounding:action` - `split`(划分集)可选值为:训练集(train)、验证集(validation)——验证集为分布内预留子集,不参与模型训练;测试集(test)预留用于分布外基准测试 ## 数据集统计 | 平台 | 任务类型 | 变体 | 训练集样本数 | 验证集样本数 | |---|---|---|---:|---:| | desktop | grounding:bbox | bbox | 1,966,349 | 2,000 | | desktop | grounding:point | point | 178,297 | 2,000 | | desktop | understanding | icon_caption | 379,462 | 2,000 | | desktop | understanding | layout | 849,696 | 2,000 | ## 图像存储 本数据集内的图像通过SHA-256哈希进行内容寻址,并支持数据集内去重。Hugging Face平台存储的`images`列嵌入了原始图像字节数据,因此Hub页面可直接生成缩略图,且`datasets.load_dataset`可直接加载使用。 针对本地工作流(如监督微调(Supervised Fine-Tuning)导出、跨数据集去重、数据集划分重平衡),可在克隆的仓库中运行[`reverse.py`](https://github.com/cua-lite/cua-lite/tree/main/scripts/hf_upload)脚本:该脚本会将每个唯一的`image_id`提取至共享存储目录`image_store/<hash[:2]>/<hash>.<ext>`,并重写Parquet文件以移除`images`列,后续数据行仅通过哈希ID引用图像。该共享存储可跨数据集复用——两个仓库中的同一图像仅会存储为一个文件。 - 唯一图像总数:**431,898** - 存储总大小:**58.18 GB** ## 补充说明 无 ## 许可证与引用 请参阅原始数据集(xlangai/Jedi) 请参阅:https://huggingface.co/datasets/xlangai/Jedi
提供机构:
cua-lite
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作