下载链接：

https://modelscope.cn/datasets/hkust-nlp/GUIMid

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <h1> Breaking the Data Barrier – Building GUI Agents Through Task Generalization </h1> </div> <div align="center"> [🐙 GitHub](https://github.com/hkust-nlp/GUIMid) | 📝 [Paper](https://arxiv.org/abs/2504.10127) | [🤗 Mid-training Data](https://huggingface.co/datasets/hkust-nlp/GUIMid/) | [🤗 Post-Training Data](https://huggingface.co/datasets/hkust-nlp/GUIMid/blob/main/GUI_trajectory.json) </div> <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/63b76e716fc56e43c3c22ca8/6fepPX_FZRCiqHgypsBMD.png" width="60%" /> </div> ## TODO List - [ ] Report and release the GUIMid with larger size and more domains (10th May expecetd) ## 1. Data Overview AgentBoard is composed of 9 diverse tasks: 7 vision and language tasks and 4 lanuage only tasks. The performances of different domains as mid-training data are as follows: | Domains | Observation | WebArena (PR) | WebArena (SR) | AndroidWorld (SR) | |----------------------------------|-------------------|--------------:|--------------:|------------------:| | **GUI Post-Training Only** | Image | 26.3 | 6.2 | 9.0 | | **Public Baselines** | | | | | | GPT-4o-2024-11-20 | Image | 36.9 | 15.6 | 11.7 | | OS-Genesis-7B | Image + Accessibility Tree | -- | -- | 17.4 | | AGUVIS-72B | Image | - | - | 26.1 | | Claude3-Haiku | Accessibility Tree| 26.8 | 12.7 | - | | Llama3-70b | Accessibility Tree| 35.6 | 12.6 | - | | Gemini1.5-Flash | Accessibility Tree| 32.4 | 11.1 | - | | **Vision-and-Language Modality** | | | | | | Chart/Document QA | Image | 24.6 | 6.2 | 15.3 | | Non-GUI Perception | Image | 28.7 | 7.6 | 14.0 | | GUI Perception | Image | 27.4 | 7.1 | 14.0 | | Web Screenshot2Code | Image | 28.0 | 6.6 | 9.9 | | Non-GUI Agents | Image | 30.8 | 8.5 | 13.5 | | Multi-modal Math ✓ | Image | 30.4 | 8.5 | 15.3 | | Multi-round Visual Conversation | Image | 30.0 | 9.0 | 12.6 | | **Language Modality** | | | | | | MathInstruct ✓ | Image | 31.9 | 10.9 | 14.4 | | Olympiad Math ✓ | Image | 31.5 | 8.5 | 13.1 | | CodeI/O ✓ | Image | 29.2 | 9.0 | 14.9 | | Web Knowledge Base | Image | 31.3 | 9.5 | 9.0 | | **Domain Combination（domains with ✓）** | | | | | | **GUIMid** | Image | **34.3** | **9.5** | **21.2** | To help researchers quickly understand evaluation data of each task, we provide **Dataset example** at the anonymous github: [🤗 GUIMid](https://github.com/hkust-nlp/GUIMid#). ## 2. Download Link You can download the json files by: ``` huggingface-cli download --resume-download hkust-nlp/GUIMid --local-dir hkust-nlp/GUIMid ``` , and then extract the images by: ```bash tar -zxcf xxx.tar.gz ``` **For users with network problems, you can try [HF-Mirror](https://hf-mirror.com/)** ## 3. Data Files Introduction ### Post-Training Data: Our post-training dataset includes multimodal data (text and images) from mobile and web domains. Text data is in `GUI_trajectory.json`, and images are in `traj.tar.gz`. ### Mid-training data for each domain We provide **mid-training data** covering **7 vision-language domains** and **4 language-only domains**: **Vision-Language Domains** - `Chart_Document_QA.json` - `GUI_Perception.json` - `Multi-modal_Math.json` - `Multi-round_Visual_Conversation.json` - `Non-GUI_Agents.json` - `Web_Screenshot2Code.json` - `Non-GUI_Perception.json` **Language-Only Domains** - `CodeIO.json` - `MathInstruct.json` - `Olympiad_Math.json` - `Web_Knowledge_Base.json` *(Image data for some domains will be released shortly.)* ### GUIMid Data We provide the GUIMid. Text data is in `GUIMid.json`, and images are in `mavis.tar.gz`. ## Citation If you find this repository helpful, feel free to cite our paper: ```bibtex @article{zhang2025breaking, title={Breaking the Data Barrier--Building GUI Agents Through Task Generalization}, author={Zhang, Junlei and Ding, Zichen and Ma, Chang and Chen, Zijie and Sun, Qiushi and Lan, Zhenzhong and He, Junxian}, journal={arXiv preprint arXiv:2504.10127}, year={2025} } ```

<div align="center"> <h1>打破数据壁垒——基于任务泛化构建图形用户界面智能体（GUI Agents）</h1> </div> <div align="center"> [🐙 GitHub 仓库](https://github.com/hkust-nlp/GUIMid) | 📝 [论文](https://arxiv.org/abs/2504.10127) | [🤗 预训练中期数据集（Mid-training Data）](https://huggingface.co/datasets/hkust-nlp/GUIMid/) | [🤗 预训练后数据集（Post-Training Data）](https://huggingface.co/datasets/hkust-nlp/GUIMid/blob/main/GUI_trajectory.json) </div> <div align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/63b76e716fc56e43c3c22ca8/6fepPX_FZRCiqHgypsBMD.png" width="60%" /> </div> ## 待办事项 - [ ] 计划于5月10日发布规模更大、覆盖领域更广的GUIMid数据集 ## 1. 数据概览 AgentBoard 涵盖9类多样化任务：7类视觉语言任务与4类纯语言任务。各领域作为预训练中期数据集的性能表现如下： | 领域 | 观测模态 | WebArena (PR) | WebArena (SR) | AndroidWorld (SR) | |----------------------------------|-------------------|--------------:|--------------:|------------------:| | **仅GUI预训练后数据** | 图像 | 26.3 | 6.2 | 9.0 | | **公开基线模型** | | | | | | GPT-4o-2024-11-20 | 图像 | 36.9 | 15.6 | 11.7 | | OS-Genesis-7B | 辅助树（Accessibility Tree） | -- | -- | 17.4 | | AGUVIS-72B | 图像 | - | - | 26.1 | | Claude3-Haiku | 辅助树（Accessibility Tree）| 26.8 | 12.7 | - | | Llama3-70b | 辅助树（Accessibility Tree）| 35.6 | 12.6 | - | | Gemini1.5-Flash | 辅助树（Accessibility Tree）| 32.4 | 11.1 | - | | **视觉语言模态** | | | | | | 图表/文档问答 | 图像 | 24.6 | 6.2 | 15.3 | | 非GUI感知 | 图像 | 28.7 | 7.6 | 14.0 | | GUI感知 | 图像 | 27.4 | 7.1 | 14.0 | | Web 截图转代码 | 图像 | 28.0 | 6.6 | 9.9 | | 非GUI智能体 | 图像 | 30.8 | 8.5 | 13.5 | | 多模态数学问答 ✓ | 图像 | 30.4 | 8.5 | 15.3 | | 多轮视觉对话 | 图像 | 30.0 | 9.0 | 12.6 | | **纯语言模态** | | | | | | MathInstruct ✓ | 图像 | 31.9 | 10.9 | 14.4 | | 奥林匹克数学竞赛 ✓ | 图像 | 31.5 | 8.5 | 13.1 | | 代码输入输出 ✓ | 图像 | 29.2 | 9.0 | 14.9 | | Web 知识库 | 图像 | 31.3 | 9.5 | 9.0 | | **领域组合（带✓标记的领域）** | | | | | | **GUIMid** | 图像 | **34.3** | **9.5** | **21.2** | 为帮助研究者快速理解各任务的评估数据，我们在匿名GitHub仓库中提供了**数据集示例**：[🤗 GUIMid](https://github.com/hkust-nlp/GUIMid#)。 ## 2. 下载链接你可以通过如下命令下载JSON数据集文件： huggingface-cli download --resume-download hkust-nlp/GUIMid --local-dir hkust-nlp/GUIMid 随后通过以下命令解压图像数据： bash tar -zxcf xxx.tar.gz 针对网络访问存在困难的用户，可尝试使用 [HF镜像站（HF-Mirror）](https://hf-mirror.com/)。 ## 3. 数据集文件说明 ### 预训练后数据集本预训练后数据集包含来自移动与Web领域的多模态数据（文本与图像）。其中文本数据存储于`GUI_trajectory.json`，图像数据存储于`traj.tar.gz`。 ### 各领域预训练中期数据集我们提供覆盖**7个视觉语言领域**与**4个纯语言领域**的**预训练中期数据集**： #### 视觉语言领域 - `Chart_Document_QA.json` - `GUI_Perception.json` - `Multi-modal_Math.json` - `Multi-round_Visual_Conversation.json` - `Non-GUI_Agents.json` - `Web_Screenshot2Code.json` - `Non-GUI_Perception.json` #### 纯语言领域 - `CodeIO.json` - `MathInstruct.json` - `Olympiad_Math.json` - `Web_Knowledge_Base.json` *部分领域的图像数据将于近期发布。* ### GUIMid 数据集本项目提供GUIMid数据集，其文本数据存储于`GUIMid.json`，图像数据存储于`mavis.tar.gz`。 ## 引用若本仓库对你的研究有所帮助，敬请引用我们的论文： bibtex @article{zhang2025breaking, title={Breaking the Data Barrier--Building GUI Agents Through Task Generalization}, author={Zhang, Junlei and Ding, Zichen and Ma, Chang and Chen, Zijie and Sun, Qiushi and Lan, Zhenzhong and He, Junxian}, journal={arXiv preprint arXiv:2504.10127}, year={2025} }

应用场景：