MultiUI

Name: MultiUI
Creator: maas
Published: 2025-12-04 16:17:41
License: 暂无描述

魔搭社区2025-12-04 更新2024-10-26 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/MultiUI

下载链接

链接失效反馈

官方服务：

资源简介：

# MulitUI #### Dataset for the paper: [Harnessing Webpage Uis For Text Rich Visual Understanding](https://arxiv.org/abs/2410.13824) 🌐 [Homepage](https://neulab.github.io/MultiUI/) | 🐍 [GitHub](https://github.com/neulab/multiui) | 📖 [arXiv](https://arxiv.org/abs/2410.13824) ## Introduction We introduce **MultiUI**, a dataset containing 7.3 million samples from 1 million websites, covering diverse multi- modal tasks and UI layouts. Models trained on **MultiUI** not only excel in web UI tasks—achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in action accuracy on a web agent dataset Mind2Web—but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65403d8781a8731a1c09a584/vk7yT4Y7ydBOHM6BojmlI.mp4"></video> ## Dataset Construction We construct MultiUI pipeline with four main stages: 1. **Website Scraping** 2. **Website Curation** using Llama-3-70b-Instruct 3. **Task Extraction** utilizing Llama-3-70b-Instruct, GPT-4o mini, and rule-based approaches to generate Web UI tasks across three categories: - Visual understanding and reasoning - Text recognition - Grounding 4. For each task, generate task samples by applying diverse instruction templates paraphrased by GPT-4o. We ultimately curated a dataset of **7.3 million** web UI-related samples in the form of VQA, covering nine tasks across perception, comprehension, grounding, and reasoning capabilities. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65403d8781a8731a1c09a584/DiJEm6mEr8smN2QkGEKLm.png) ### Task samples To enhance multimodal models’ perception, comprehension, grounding, and reasoning capabilities, we have designed a diverse set of nine tasks, emphasizing the critical abilities for text-rich visual understanding scenarios. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65403d8781a8731a1c09a584/BkxEuNOcyR19IPhu7yZKK.png) ## Files Description As described in our paper, we developed a two-stage training pipeline. We randomly split the entire dataset into two parts: the first part, stored in `stage1_data.json`, accounts for 95% of the total data, while the second part, stored in `stage2_data_to_be_combined_with_general_data.json`, accounts for 5%. During stage 2, we combine the stage-2 data and [LLaVA-NeXT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) data to further finetune models trained on stage-1 data. ## Dataset Disclaimer The MultiUI dataset is released for open-source use by the research and developer community. The data is largely sourced from publicly available web content or generated by large language models (LLMs). We constructed this dataset using links from Hugging Face’s FineWeb dataset, which is based on a Common Crawl dump, representing publicly accessible data from the web. This dataset is mostly intended for research purposes, it may contain material that could have inaccuracies, biases, or other unintended issues. We do not intentionally include any copyrighted material, and any resemblance to such content is unintentional. If you have any concerns regarding specific data or believe that any content should be removed, please contact us, and we will review the request and take appropriate action. ## Contact * Junpeng Liu: jpliu@link.cuhk.edu.hk * Xiang Yue: xyue2@andrew.cmu.edu ## Citation If you find this work helpful, please cite out paper: ```` @misc{liu2024harnessingwebpageuistextrich, title={Harnessing Webpage UIs for Text-Rich Visual Understanding}, author={Junpeng Liu and Tianyue Ou and Yifan Song and Yuxiao Qu and Wai Lam and Chenyan Xiong and Wenhu Chen and Graham Neubig and Xiang Yue}, year={2024}, eprint={2410.13824}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.13824}, } ````

# MultiUI #### 适配论文《面向富文本视觉理解的网页UI利用（Harnessing Webpage UIs For Text-Rich Visual Understanding）》的数据集 🌐 [主页](https://neulab.github.io/MultiUI/) | 🐍 [GitHub仓库](https://github.com/neulab/multiui) | 📖 [arXiv论文](https://arxiv.org/abs/2410.13824) ## 引言我们提出**MultiUI**，这是一个包含来自100万个网站的730万条样本的数据集，覆盖多样化的多模态任务与UI布局。在MultiUI上训练的模型不仅在网页UI任务中表现优异——在VisualWebBench基准上最高实现48%的性能提升，在网页智能体数据集Mind2Web上的动作准确率提升19.1%——还能出色泛化至非网页UI任务，乃至非UI领域，例如文档理解、光学字符识别（Optical Character Recognition, OCR）以及图表解读。 <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65403d8781a8731a1c09a584/vk7yT4Y7ydBOHM6BojmlI.mp4"></video> ## 数据集构建我们构建了包含四个主要阶段的MultiUI流水线： 1. **网页爬取（Website Scraping）** 2. **网页筛选（Website Curation）**，采用Llama-3-70b-Instruct完成 3. **任务抽取（Task Extraction）**，结合Llama-3-70b-Instruct、GPT-4o mini与基于规则的方法，生成三类网页UI任务： - 视觉理解与推理 - 文本识别 - 视觉定位（Grounding） 4. 针对每项任务，通过应用由GPT-4o改写的多样化指令模板生成任务样本。我们最终筛选得到包含730万条网页UI相关样本的数据集，以视觉问答（Visual Question Answering, VQA）的形式呈现，覆盖感知、理解、视觉定位与推理能力相关的九项任务。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65403d8781a8731a1c09a584/DiJEm6mEr8smN2QkGEKLm.png) ### 任务样本为提升多模态模型的感知、理解、视觉定位与推理能力，我们设计了涵盖九项任务的多样化集合，着重针对富文本视觉理解场景中的核心能力。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65403d8781a8731a1c09a584/BkxEuNOcyR19IPhu7yZKK.png) ## 文件说明正如我们在论文中所述，我们开发了两阶段训练流水线。我们将整个数据集随机划分为两部分：第一部分存储于`stage1_data.json`，占总数据的95%；第二部分存储于`stage2_data_to_be_combined_with_general_data.json`，占总数据的5%。在第二阶段训练中，我们将第二阶段数据与[LLaVA-NeXT](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data)数据集相结合，对在第一阶段数据上训练得到的模型进行进一步微调。 ## 数据集免责声明 MultiUI数据集面向研究与开发者社区开源发布。该数据集的绝大部分数据来源于公开网页内容，或由大语言模型（Large Language Model, LLM）生成。我们基于Common Crawl数据集快照的Hugging Face FineWeb数据集链接构建本数据集，涵盖网页上的公开可访问数据。本数据集主要用于研究用途，可能包含不准确、偏见或其他非预期问题。我们未有意包含任何受版权保护的内容，与这类内容的任何相似均属无意。若您对特定数据存在疑虑，或认为应移除某类内容，请与我们联系，我们将审核相关请求并采取适当措施。 ## 联系方式 * 刘俊鹏：jpliu@link.cuhk.edu.hk * 岳翔：xyue2@andrew.cmu.edu ## 引用若您认为本工作对您有所帮助，请引用我们的论文： bibtex @misc{liu2024harnessingwebpageuistextrich, title={Harnessing Webpage UIs for Text-Rich Visual Understanding}, author={Junpeng Liu and Tianyue Ou and Yifan Song and Yuxiao Qu and Wai Lam and Chenyan Xiong and Wenhu Chen and Graham Neubig and Xiang Yue}, year={2024}, eprint={2410.13824}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.13824}, }

提供机构：

maas

创建时间：

2024-10-23

搜集汇总

数据集介绍