GUIMid
收藏魔搭社区2026-01-06 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/GUIMid
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<h1> Breaking the Data Barrier – Building GUI Agents Through Task Generalization </h1>
</div>
<div align="center">
[🐙 GitHub](https://github.com/hkust-nlp/GUIMid) | 📝 [Paper](https://arxiv.org/abs/2504.10127) | [🤗 Mid-training Data](https://huggingface.co/datasets/hkust-nlp/GUIMid/) | [🤗 Post-Training Data](https://huggingface.co/datasets/hkust-nlp/GUIMid/blob/main/GUI_trajectory.json)
</div>
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63b76e716fc56e43c3c22ca8/6fepPX_FZRCiqHgypsBMD.png" width="60%" />
</div>
## TODO List
- [ ] Report and release the GUIMid with larger size and more domains (10th May expecetd)
## 1. Data Overview
AgentBoard is composed of 9 diverse tasks: 7 vision and language tasks and 4 lanuage only tasks.
The performances of different domains as mid-training data are as follows:
| Domains | Observation | WebArena (PR) | WebArena (SR) | AndroidWorld (SR) |
|----------------------------------|-------------------|--------------:|--------------:|------------------:|
| **GUI Post-Training Only** | Image | 26.3 | 6.2 | 9.0 |
| **Public Baselines** | | | | |
| GPT-4o-2024-11-20 | Image | 36.9 | 15.6 | 11.7 |
| OS-Genesis-7B | Image + Accessibility Tree | -- | -- | 17.4 |
| AGUVIS-72B | Image | - | - | 26.1 |
| Claude3-Haiku | Accessibility Tree| 26.8 | 12.7 | - |
| Llama3-70b | Accessibility Tree| 35.6 | 12.6 | - |
| Gemini1.5-Flash | Accessibility Tree| 32.4 | 11.1 | - |
| **Vision-and-Language Modality** | | | | |
| Chart/Document QA | Image | 24.6 | 6.2 | 15.3 |
| Non-GUI Perception | Image | 28.7 | 7.6 | 14.0 |
| GUI Perception | Image | 27.4 | 7.1 | 14.0 |
| Web Screenshot2Code | Image | 28.0 | 6.6 | 9.9 |
| Non-GUI Agents | Image | 30.8 | 8.5 | 13.5 |
| Multi-modal Math ✓ | Image | 30.4 | 8.5 | 15.3 |
| Multi-round Visual Conversation | Image | 30.0 | 9.0 | 12.6 |
| **Language Modality** | | | | |
| MathInstruct ✓ | Image | 31.9 | 10.9 | 14.4 |
| Olympiad Math ✓ | Image | 31.5 | 8.5 | 13.1 |
| CodeI/O ✓ | Image | 29.2 | 9.0 | 14.9 |
| Web Knowledge Base | Image | 31.3 | 9.5 | 9.0 |
| **Domain Combination(domains with ✓)** | | | | |
| **GUIMid** | Image | **34.3** | **9.5** | **21.2** |
To help researchers quickly understand evaluation data of each task, we provide **Dataset example** at the anonymous github: [🤗 GUIMid](https://github.com/hkust-nlp/GUIMid#).
## 2. Download Link
You can download the json files by:
```
huggingface-cli download --resume-download hkust-nlp/GUIMid --local-dir hkust-nlp/GUIMid
```
, and then extract the images by:
```bash
tar -zxcf xxx.tar.gz
```
**For users with network problems, you can try [HF-Mirror](https://hf-mirror.com/)**
## 3. Data Files Introduction
### Post-Training Data:
Our post-training dataset includes multimodal data (text and images) from mobile and web domains. Text data is in `GUI_trajectory.json`, and images are in `traj.tar.gz`.
### Mid-training data for each domain
We provide **mid-training data** covering **7 vision-language domains** and **4 language-only domains**:
**Vision-Language Domains**
- `Chart_Document_QA.json`
- `GUI_Perception.json`
- `Multi-modal_Math.json`
- `Multi-round_Visual_Conversation.json`
- `Non-GUI_Agents.json`
- `Web_Screenshot2Code.json`
- `Non-GUI_Perception.json`
**Language-Only Domains**
- `CodeIO.json`
- `MathInstruct.json`
- `Olympiad_Math.json`
- `Web_Knowledge_Base.json`
*(Image data for some domains will be released shortly.)*
### GUIMid Data
We provide the GUIMid. Text data is in `GUIMid.json`, and images are in `mavis.tar.gz`.
## Citation
If you find this repository helpful, feel free to cite our paper:
```bibtex
@article{zhang2025breaking,
title={Breaking the Data Barrier--Building GUI Agents Through Task Generalization},
author={Zhang, Junlei and Ding, Zichen and Ma, Chang and Chen, Zijie and Sun, Qiushi and Lan, Zhenzhong and He, Junxian},
journal={arXiv preprint arXiv:2504.10127},
year={2025}
}
```
<div align="center">
<h1>打破数据壁垒——基于任务泛化构建图形用户界面智能体(GUI Agents)</h1>
</div>
<div align="center">
[🐙 GitHub 仓库](https://github.com/hkust-nlp/GUIMid) | 📝 [论文](https://arxiv.org/abs/2504.10127) | [🤗 预训练中期数据集(Mid-training Data)](https://huggingface.co/datasets/hkust-nlp/GUIMid/) | [🤗 预训练后数据集(Post-Training Data)](https://huggingface.co/datasets/hkust-nlp/GUIMid/blob/main/GUI_trajectory.json)
</div>
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63b76e716fc56e43c3c22ca8/6fepPX_FZRCiqHgypsBMD.png" width="60%" />
</div>
## 待办事项
- [ ] 计划于5月10日发布规模更大、覆盖领域更广的GUIMid数据集
## 1. 数据概览
AgentBoard 涵盖9类多样化任务:7类视觉语言任务与4类纯语言任务。
各领域作为预训练中期数据集的性能表现如下:
| 领域 | 观测模态 | WebArena (PR) | WebArena (SR) | AndroidWorld (SR) |
|----------------------------------|-------------------|--------------:|--------------:|------------------:|
| **仅GUI预训练后数据** | 图像 | 26.3 | 6.2 | 9.0 |
| **公开基线模型** | | | | |
| GPT-4o-2024-11-20 | 图像 | 36.9 | 15.6 | 11.7 |
| OS-Genesis-7B | 辅助树(Accessibility Tree) | -- | -- | 17.4 |
| AGUVIS-72B | 图像 | - | - | 26.1 |
| Claude3-Haiku | 辅助树(Accessibility Tree)| 26.8 | 12.7 | - |
| Llama3-70b | 辅助树(Accessibility Tree)| 35.6 | 12.6 | - |
| Gemini1.5-Flash | 辅助树(Accessibility Tree)| 32.4 | 11.1 | - |
| **视觉语言模态** | | | | |
| 图表/文档问答 | 图像 | 24.6 | 6.2 | 15.3 |
| 非GUI感知 | 图像 | 28.7 | 7.6 | 14.0 |
| GUI感知 | 图像 | 27.4 | 7.1 | 14.0 |
| Web 截图转代码 | 图像 | 28.0 | 6.6 | 9.9 |
| 非GUI智能体 | 图像 | 30.8 | 8.5 | 13.5 |
| 多模态数学问答 ✓ | 图像 | 30.4 | 8.5 | 15.3 |
| 多轮视觉对话 | 图像 | 30.0 | 9.0 | 12.6 |
| **纯语言模态** | | | | |
| MathInstruct ✓ | 图像 | 31.9 | 10.9 | 14.4 |
| 奥林匹克数学竞赛 ✓ | 图像 | 31.5 | 8.5 | 13.1 |
| 代码输入输出 ✓ | 图像 | 29.2 | 9.0 | 14.9 |
| Web 知识库 | 图像 | 31.3 | 9.5 | 9.0 |
| **领域组合(带✓标记的领域)** | | | | |
| **GUIMid** | 图像 | **34.3** | **9.5** | **21.2** |
为帮助研究者快速理解各任务的评估数据,我们在匿名GitHub仓库中提供了**数据集示例**:[🤗 GUIMid](https://github.com/hkust-nlp/GUIMid#)。
## 2. 下载链接
你可以通过如下命令下载JSON数据集文件:
huggingface-cli download --resume-download hkust-nlp/GUIMid --local-dir hkust-nlp/GUIMid
随后通过以下命令解压图像数据:
bash
tar -zxcf xxx.tar.gz
针对网络访问存在困难的用户,可尝试使用 [HF镜像站(HF-Mirror)](https://hf-mirror.com/)。
## 3. 数据集文件说明
### 预训练后数据集
本预训练后数据集包含来自移动与Web领域的多模态数据(文本与图像)。其中文本数据存储于`GUI_trajectory.json`,图像数据存储于`traj.tar.gz`。
### 各领域预训练中期数据集
我们提供覆盖**7个视觉语言领域**与**4个纯语言领域**的**预训练中期数据集**:
#### 视觉语言领域
- `Chart_Document_QA.json`
- `GUI_Perception.json`
- `Multi-modal_Math.json`
- `Multi-round_Visual_Conversation.json`
- `Non-GUI_Agents.json`
- `Web_Screenshot2Code.json`
- `Non-GUI_Perception.json`
#### 纯语言领域
- `CodeIO.json`
- `MathInstruct.json`
- `Olympiad_Math.json`
- `Web_Knowledge_Base.json`
*部分领域的图像数据将于近期发布。*
### GUIMid 数据集
本项目提供GUIMid数据集,其文本数据存储于`GUIMid.json`,图像数据存储于`mavis.tar.gz`。
## 引用
若本仓库对你的研究有所帮助,敬请引用我们的论文:
bibtex
@article{zhang2025breaking,
title={Breaking the Data Barrier--Building GUI Agents Through Task Generalization},
author={Zhang, Junlei and Ding, Zichen and Ma, Chang and Chen, Zijie and Sun, Qiushi and Lan, Zhenzhong and He, Junxian},
journal={arXiv preprint arXiv:2504.10127},
year={2025}
}
提供机构:
maas
创建时间:
2025-04-22



