five

Mind2Web

收藏
魔搭社区2026-01-09 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/osunlp/Mind2Web
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Dataset Name ## Dataset Description - **Homepage:** https://osu-nlp-group.github.io/Mind2Web/ - **Repository:** https://github.com/OSU-NLP-Group/Mind2Web - **Paper:** https://arxiv.org/abs/2306.06070 - **Point of Contact:** [Xiang Deng](mailto:deng.595@osu.edu) ### Dataset Summary Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1. diverse domains, websites, and tasks, 2. use of real-world websites instead of simulated and simplified ones, and 3. a broad spectrum of user interaction patterns. ## Dataset Structure ### Data Fields - "annotation_id" (str): unique id for each task - "website" (str): website name - "domain" (str): website domain - "subdomain" (str): website subdomain - "confirmed_task" (str): task description - "action_reprs" (list[str]): human readable string representation of the action sequence - "actions" (list[dict]): list of actions (steps) to complete the task - "action_uid" (str): unique id for each action (step) - "raw_html" (str): raw html of the page before the action is performed - "cleaned_html" (str): cleaned html of the page before the action is performed - "operation" (dict): operation to perform - "op" (str): operation type, one of CLICK, TYPE, SELECT - "original_op" (str): original operation type, contain additional HOVER and ENTER that are mapped to CLICK, not used - "value" (str): optional value for the operation, e.g., text to type, option to select - "pos_candidates" (list[dict]): ground truth elements. Here we only include positive elements that exist in "cleaned_html" after our preprocessing, so "pos_candidates" might be empty. The original labeled element can always be found in the "raw_html". - "tag" (str): tag of the element - "is_original_target" (bool): whether the element is the original target labeled by the annotator - "is_top_level_target" (bool): whether the element is a top level target find by our algorithm. please see the paper for more details. - "backend_node_id" (str): unique id for the element - "attributes" (str): serialized attributes of the element, use `json.loads` to convert back to dict - "neg_candidates" (list[dict]): other candidate elements in the page after preprocessing, has similar structure as "pos_candidates" ### Data Splits - train: 1,009 instances - test: (To prevent potential data leakage, please check our [repo](https://github.com/OSU-NLP-Group/Mind2Web) for information on obtaining the test set.) - Cross Task: 252 instances, tasks from the same website are seen during training - Cross Website: 177 instances, websites are not seen during training - Cross Domain: 9,12 instances, entire domains are not seen during training ### Licensing Information <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>. ### Disclaimer This dataset was collected and released solely for research purposes, with the goal of making the web more accessible via language technologies. The authors are strongly against any potential harmful use of the data or technology to any party. ### Citation Information ``` @misc{deng2023mind2web, title={Mind2Web: Towards a Generalist Agent for the Web}, author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su}, year={2023}, eprint={2306.06070}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

# Mind2Web数据集卡片(Dataset Card) ## 数据集描述 - **官网(Homepage)**: https://osu-nlp-group.github.io/Mind2Web/ - **代码仓库(Repository)**: https://github.com/OSU-NLP-Group/Mind2Web - **论文(Paper)**: https://arxiv.org/abs/2306.06070 - **联系人(Point of Contact)**: [Xiang Deng](mailto:deng.595@osu.edu) ### 数据集概况 Mind2Web是一款用于开发和评估通用网页AI智能体(AI Agent)的数据集,该智能体可遵循语言指令在任意网站上完成复杂任务。现有面向网页智能体的数据集要么采用模拟网站,要么仅覆盖有限的网站与任务类型,因此不适用于通用网页智能体。Mind2Web从覆盖31个领域的137个网站中收集了超过2000个开放式任务,并为这些任务配备了众包得到的操作序列,为构建通用网页智能体提供了三大必要要素:1. 多样化的领域、网站与任务;2. 采用真实网站而非模拟或简化网站;3. 覆盖广泛的用户交互模式。 ## 数据集结构 ### 数据字段 - "annotation_id" (str): 每个任务的唯一标识符 - "website" (str): 网站名称 - "domain" (str): 网站领域 - "subdomain" (str): 网站子领域 - "confirmed_task" (str): 任务描述 - "action_reprs" (list[str]): 操作序列的人类可读字符串表示 - "actions" (list[dict]): 完成任务所需的操作(步骤)列表 - "action_uid" (str): 每个操作(步骤)的唯一标识符 - "raw_html" (str): 执行操作前页面的原始HTML - "cleaned_html" (str): 执行操作前页面经过清洗的HTML - "operation" (dict): 待执行的操作 - "op" (str): 操作类型,可选值为CLICK、TYPE、SELECT - "original_op" (str): 原始操作类型,包含额外的HOVER和ENTER(均映射为CLICK),未实际使用 - "value" (str): 操作的可选参数,例如待输入的文本、待选择的选项 - "pos_candidates" (list[dict]): 正样本候选元素。此处仅包含预处理后仍存在于cleaned_html中的正样本元素,因此pos_candidates可能为空。原始标注的元素始终可在raw_html中找到。 - "tag" (str): 元素的标签 - "is_original_target" (bool): 该元素是否为标注者标注的原始目标 - "is_top_level_target" (bool): 该元素是否为我们的算法识别出的顶级目标,详见相关论文 - "backend_node_id" (str): 元素的唯一标识符 - "attributes" (str): 元素的序列化属性,可使用`json.loads`转换为字典 - "neg_candidates" (list[dict]): 负样本候选元素。预处理后页面中的其他候选元素,结构与pos_candidates一致 ### 数据划分 - 训练集:1009个样本 - 测试集:(为防止潜在的数据泄露,请查阅我们的[代码仓库](https://github.com/OSU-NLP-Group/Mind2Web)以获取测试集获取方式。) - 跨任务划分:252个样本,训练阶段可见同一网站的其他任务 - 跨网站划分:177个样本,训练阶段未见过对应网站 - 跨领域划分:912个样本,训练阶段未见过对应领域 ### 授权信息 <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />本作品采用<a rel="license" href="http://creativecommons.org/licenses/by/4.0/">知识共享署名4.0国际许可协议</a>进行许可。 ### 免责声明 本数据集仅为研究目的收集并发布,旨在通过语言技术提升网络的可访问性。作者强烈反对任何可能对任何主体造成危害的数据或技术使用方式。 ### 引用信息 @misc{deng2023mind2web, title={Mind2Web: Towards a Generalist Agent for the Web}, author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su}, year={2023}, eprint={2306.06070}, archivePrefix={arXiv}, primaryClass={cs.CL} }
提供机构:
maas
创建时间:
2025-07-04
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Mind2Web是一个用于开发通用网页代理的数据集,包含2000多个任务,覆盖137个真实网站和31个领域,支持通过语言指令完成复杂任务。数据集提供了多样化的任务、真实网站环境和广泛的用户交互模式,适合训练和评估通用网页代理。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作