Multimodal-Mind2Web
收藏魔搭社区2026-05-09 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/swift/Multimodal-Mind2Web
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
- **Homepage:** https://osu-nlp-group.github.io/SeeAct/
- **Repository:** https://github.com/OSU-NLP-Group/SeeAct
- **Paper:** https://arxiv.org/abs/2401.01614
- **Point of Contact:** [Boyuan Zheng](mailto:zheng.2372@osu.edu)
### Dataset Summary
Multimodal-Mind2Web is the multimodal version of [Mind2Web](https://osu-nlp-group.github.io/Mind2Web/), a dataset for developing and evaluating generalist agents
for the web that can follow language instructions to complete complex tasks on any website. In this dataset, we align each HTML document in the dataset with
its corresponding webpage screenshot image from the Mind2Web raw dump. This multimodal version addresses the inconvenience of loading images from the ~300GB Mind2Web Raw Dump.
## Dataset Structure
### Data Splits
- train: 7775 actions from 1009 tasks.
- test_task: 1339 actions from 177 tasks. Tasks from the same website are seen during training.
- test_website: 1019 actions from 142 tasks. Websites are not seen during training.
- test_domain: 4060 actions from 694 tasks. Entire domains are not seen during training.
The **_train_** set may include some screenshot images not properly rendered caused by rendering issues during Mind2Web annotation. The three **_test splits (test_task, test_website, test_domain)_** have undergone human verification to confirm element visibility and correct rendering for action prediction.
### Data Fields
Each line in the dataset is an action consisting of screenshot image, HTML text and other fields required for action prediction, for the convenience of inference.
- "annotation_id" (str): unique id for each task
- "website" (str): website name
- "domain" (str): website domain
- "subdomain" (str): website subdomain
- "confirmed_task" (str): task description
- **"screenshot" (str): path to the webpage screenshot image corresponding to the HTML.**
- "action_uid" (str): unique id for each action (step)
- "raw_html" (str): raw html of the page before the action is performed
- "cleaned_html" (str): cleaned html of the page before the action is performed
- "operation" (dict): operation to perform
- "op" (str): operation type, one of CLICK, TYPE, SELECT
- "original_op" (str): original operation type, contain additional HOVER and ENTER that are mapped to CLICK, not used
- "value" (str): optional value for the operation, e.g., text to type, option to select
- "pos_candidates" (list[dict]): ground truth elements. Here we only include positive elements that exist in "cleaned_html" after our preprocessing, so "pos_candidates" might be empty. The original labeled element can always be found in the "raw_html".
- "tag" (str): tag of the element
- "is_original_target" (bool): whether the element is the original target labeled by the annotator
- "is_top_level_target" (bool): whether the element is a top level target find by our algorithm. please see the paper for more details.
- "backend_node_id" (str): unique id for the element
- "attributes" (str): serialized attributes of the element, use `json.loads` to convert back to dict
- "neg_candidates" (list[dict]): other candidate elements in the page after preprocessing, has similar structure as "pos_candidates"
- "action_reprs" (list[str]): human readable string representation of the action sequence
- "target_action_index" (str): the index of the target action in the action sequence
- "target_action_reprs" (str): human readable string representation of the target action
### Disclaimer
This dataset was collected and released solely for research purposes, with the goal of making the web more accessible via language technologies. The authors are strongly against any potential harmful use of the data or technology to any party.
### Citation Information
```
@article{zheng2024seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=piecKJ2DlB},
}
@inproceedings{deng2023mindweb,
title={Mind2Web: Towards a Generalist Agent for the Web},
author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=kiYqbO3wqw}
}
```
## 数据集描述
- **数据集主页**:https://osu-nlp-group.github.io/SeeAct/
- **代码仓库**:https://github.com/OSU-NLP-Group/SeeAct
- **相关论文**:https://arxiv.org/abs/2401.01614
- **联系方式**:[郑博元(Boyuan Zheng)](mailto:zheng.2372@osu.edu)
### 数据集概述
多模态Mind2Web(Multimodal-Mind2Web)是[Mind2Web](https://osu-nlp-group.github.io/Mind2Web/)的多模态变体,后者是用于开发与评估通用网页AI智能体(AI Agent)的数据集——这类智能体可遵循自然语言指令,在任意网站上完成复杂任务。本数据集将Mind2Web原始数据集中的每份HTML文档与其对应的网页截图进行对齐,解决了从约300GB的Mind2Web原始数据集中加载图片的不便之处。
### 数据集结构
#### 数据划分
- 训练集(train):包含来自1009个任务的7775条操作记录。
- 任务测试集(test_task):包含来自177个任务的1339条操作记录,训练阶段可见同网站的任务。
- 网站测试集(test_website):包含来自142个任务的1019条操作记录,训练阶段未见过对应网站。
- 领域测试集(test_domain):包含来自694个任务的4060条操作记录,训练阶段未见过对应完整领域。
训练集可能包含部分因Mind2Web标注阶段渲染问题导致未正确渲染的截图。三个测试划分(任务测试集、网站测试集、领域测试集)均经过人工验证,以确认元素可见性与操作预测所需的正确渲染效果。
#### 数据字段
数据集的每一行均为一条操作数据,包含网页截图、HTML文本及其他操作预测所需字段,便于推理部署。
- `"annotation_id"`(字符串类型):每个任务的唯一标识符
- `"website"`(字符串类型):网站名称
- `"domain"`(字符串类型):网站域名
- `"subdomain"`(字符串类型):网站子域名
- `"confirmed_task"`(字符串类型):确认后的任务描述
- **`"screenshot"`(字符串类型):与对应HTML匹配的网页截图文件路径**
- `"action_uid"`(字符串类型):每个操作(步骤)的唯一标识符
- `"raw_html"`(字符串类型):执行操作前的页面原始HTML代码
- `"cleaned_html"`(字符串类型):执行操作前的页面清洗后HTML代码
- `"operation"`(字典类型):待执行的操作
- `"op"`(字符串类型):操作类型,可选值为CLICK、TYPE、SELECT
- `"original_op"`(字符串类型):原始操作类型,包含额外的HOVER和ENTER(已映射为CLICK),无需使用
- `"value"`(字符串类型):操作的可选参数,例如待输入的文本、待选择的选项
- `"pos_candidates"`(列表类型,元素为字典):真实候选元素。此处仅包含预处理后仍存在于`"cleaned_html"`中的正样本元素,因此`"pos_candidates"`可能为空。标注的原始元素始终可在`"raw_html"`中找到。
- `"tag"`(字符串类型):元素的HTML标签
- `"is_original_target"`(布尔类型):该元素是否为标注者标注的原始目标元素
- `"is_top_level_target"`(布尔类型):该元素是否为算法识别的顶级目标元素,详细说明请参见论文
- `"backend_node_id"`(字符串类型):元素的唯一标识符
- `"attributes"`(字符串类型):序列化后的元素属性,可使用`json.loads`转换为字典格式
- `"neg_candidates"`(列表类型,元素为字典):预处理后页面中的其他候选元素,结构与`"pos_candidates"`一致
- `"action_reprs"`(列表类型,元素为字符串):操作序列的人类可读表述
- `"target_action_index"`(字符串类型):目标操作在操作序列中的索引
- `"target_action_reprs"`(字符串类型):目标操作的人类可读表述
### 免责声明
本数据集仅为学术研究目的收集并发布,旨在通过语言技术提升网页的可访问性。作者强烈反对任何将本数据集或相关技术用于损害任何一方利益的潜在有害用途。
### 引用信息
@article{zheng2024seeact,
title={GPT-4V(ision)是通用网页智能体,若具备接地性},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
booktitle={第41届国际机器学习大会},
year={2024},
url={https://openreview.net/forum?id=piecKJ2DlB},
}
@inproceedings{deng2023mindweb,
title={Mind2Web:面向通用网页智能体},
author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su},
booktitle={第37届神经信息处理系统大会},
year={2023},
url={https://openreview.net/forum?id=kiYqbO3wqw}
}
提供机构:
maas
创建时间:
2024-06-06
搜集汇总
数据集介绍

背景与挑战
背景概述
Multimodal-Mind2Web是Mind2Web的多模态版本,用于开发和评估能够遵循语言指令完成复杂任务的通用网络代理。该数据集将HTML文档与对应的网页截图对齐,解决了从大型原始转储中加载图像的不便,包含训练集和三个测试集,适用于不同场景的测试。
以上内容由遇见数据集搜集并总结生成



