osunlp/Mind2Web

Name: osunlp/Mind2Web
Creator: osunlp
Published: 2025-10-19 19:15:03
License: 暂无描述

Hugging Face2025-10-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/osunlp/Mind2Web

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en tags: - Web Agent size_categories: - 1K<n<10K --- # Dataset Card for Dataset Name ## Dataset Description - **Homepage:** https://osu-nlp-group.github.io/Mind2Web/ - **Repository:** https://github.com/OSU-NLP-Group/Mind2Web - **Paper:** https://arxiv.org/abs/2306.06070 - **Point of Contact:** [Xiang Deng](mailto:deng.595@osu.edu) ### Dataset Summary Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1. diverse domains, websites, and tasks, 2. use of real-world websites instead of simulated and simplified ones, and 3. a broad spectrum of user interaction patterns. ## Dataset Structure ### Data Fields - "annotation_id" (str): unique id for each task - "website" (str): website name - "domain" (str): website domain - "subdomain" (str): website subdomain - "confirmed_task" (str): task description - "action_reprs" (list[str]): human readable string representation of the action sequence - "actions" (list[dict]): list of actions (steps) to complete the task - "action_uid" (str): unique id for each action (step) - "raw_html" (str): raw html of the page before the action is performed - "cleaned_html" (str): cleaned html of the page before the action is performed - "operation" (dict): operation to perform - "op" (str): operation type, one of CLICK, TYPE, SELECT - "original_op" (str): original operation type, contain additional HOVER and ENTER that are mapped to CLICK, not used - "value" (str): optional value for the operation, e.g., text to type, option to select - "pos_candidates" (list[dict]): ground truth elements. Here we only include positive elements that exist in "cleaned_html" after our preprocessing, so "pos_candidates" might be empty. The original labeled element can always be found in the "raw_html". - "tag" (str): tag of the element - "is_original_target" (bool): whether the element is the original target labeled by the annotator - "is_top_level_target" (bool): whether the element is a top level target find by our algorithm. please see the paper for more details. - "backend_node_id" (str): unique id for the element - "attributes" (str): serialized attributes of the element, use `json.loads` to convert back to dict - "neg_candidates" (list[dict]): other candidate elements in the page after preprocessing, has similar structure as "pos_candidates" ### Data Splits - train: 1,009 instances - test: (To prevent potential data leakage, please check our [repo](https://github.com/OSU-NLP-Group/Mind2Web) for information on obtaining the test set.) - Cross Task: 252 instances, tasks from the same website are seen during training - Cross Website: 177 instances, websites are not seen during training - Cross Domain: 9,12 instances, entire domains are not seen during training ### Licensing Information <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>. ### Disclaimer This dataset was collected and released solely for research purposes, with the goal of making the web more accessible via language technologies. The authors are strongly against any potential harmful use of the data or technology to any party. ### Citation Information ``` @misc{deng2023mind2web, title={Mind2Web: Towards a Generalist Agent for the Web}, author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su}, year={2023}, eprint={2306.06070}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

--- 许可协议：CC BY 4.0（知识共享署名4.0国际许可协议）语言： - 英语标签： - Web智能体（Web Agent）样本规模类别： - 1000 < 样本量 < 10000 --- ## 数据集卡片：Mind2Web ## 数据集描述 - **项目主页：** https://osu-nlp-group.github.io/Mind2Web/ - **代码仓库：** https://github.com/OSU-NLP-Group/Mind2Web - **相关论文：** https://arxiv.org/abs/2306.06070 - **联系方式：** [邓翔（Xiang Deng）](mailto:deng.595@osu.edu) ### 数据集摘要 Mind2Web是一款用于开发和评估通用网页智能体的数据集，这类智能体可遵循自然语言指令，在任意网站上完成复杂任务。现有的网页智能体数据集要么采用模拟网站，要么仅覆盖有限的网站与任务类型，因此无法适配通用网页智能体的研发需求。Mind2Web从覆盖31个领域的137个真实网站中收集了超过2000个开放式任务，并附带了众包标注的任务执行动作序列，为通用网页智能体的构建提供了三项核心要素：1. 多样化的领域、网站与任务类型；2. 采用真实网站而非模拟或简化的网站环境；3. 覆盖广泛的用户交互模式。 ## 数据集结构 ### 数据字段 - "annotation_id"（字符串类型）：每个任务的唯一标识符 - "website"（字符串类型）：网站名称 - "domain"（字符串类型）：网站所属领域 - "subdomain"（字符串类型）：网站所属子领域 - "confirmed_task"（字符串类型）：任务描述 - "action_reprs"（字符串列表）：动作序列的人类可读文本表示 - "actions"（字典列表）：完成任务所需的动作（步骤）列表 - "action_uid"（字符串类型）：每个动作（步骤）的唯一标识符 - "raw_html"（字符串类型）：执行动作前的页面原始HTML代码 - "cleaned_html"（字符串类型）：执行动作前经过预处理的页面HTML代码 - "operation"（字典类型）：待执行的操作 - "op"（字符串类型）：操作类型，可选值为CLICK（点击）、TYPE（输入）、SELECT（选择） - "original_op"（字符串类型）：原始操作类型，包含额外映射至CLICK的HOVER（悬停）与ENTER（回车）操作，本数据集未使用该字段 - "value"（字符串类型，可选）：操作所需的附加参数，例如待输入的文本、待选择的选项 - "pos_candidates"（字典列表）：基准真值正样本元素。本字段仅包含预处理后仍存在于"cleaned_html"中的正样本元素，因此"pos_candidates"可能为空。原始标注的元素始终可在"raw_html"中找到。 - "tag"（字符串类型）：元素的HTML标签 - "is_original_target"（布尔类型）：该元素是否为标注者标注的原始目标元素 - "is_top_level_target"（布尔类型）：该元素是否为算法识别的顶级目标元素，详细说明请参考相关论文 - "backend_node_id"（字符串类型）：元素的唯一后端标识符 - "attributes"（字符串类型）：元素的序列化属性，可使用`json.loads`方法转换为字典格式 - "neg_candidates"（字典列表）：预处理后页面中的负样本候选元素，结构与"pos_candidates"一致 ### 数据划分 - 训练集：1009个样本 - 测试集：（为防止潜在的数据泄露，请访问我们的[代码仓库](https://github.com/OSU-NLP-Group/Mind2Web)获取测试集获取方式相关信息。） - 跨任务划分：252个样本，训练集已见过来自同一网站的任务 - 跨网站划分：177个样本，训练集未见过对应的网站 - 跨领域划分：912个样本，训练集未接触过对应的完整领域 ## 许可协议信息 <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="知识共享许可协议" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />本作品采用<a rel="license" href="http://creativecommons.org/licenses/by/4.0/">知识共享署名4.0国际许可协议</a>进行许可。 ## 免责声明本数据集仅为研究目的而收集并发布，旨在通过语言技术提升网页的可访问性。作者强烈反对任何可能对任何主体造成伤害的数据或技术使用方式。 ## 引用信息 @misc{deng2023mind2web, title={Mind2Web: Towards a Generalist Agent for the Web}, author={Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su}, year={2023}, eprint={2306.06070}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

osunlp

原始信息汇总

数据集概述

数据集名称： Mind2Web

数据集目的： 用于开发和评估能够遵循语言指令在任何网站上完成复杂任务的通用网络代理。

数据集特点：

包含超过2,000个开放式任务，覆盖137个网站和31个领域。
使用真实网站而非模拟或简化网站。
提供广泛的用户交互模式。

数据集结构

数据字段

"annotation_id" (str): 任务的唯一ID。
"website" (str): 网站名称。
"domain" (str): 网站域名。
"subdomain" (str): 网站子域名。
"confirmed_task" (str): 任务描述。
"action_reprs" (list[str]): 动作序列的人类可读字符串表示。
"actions" (list[dict]): 完成任务的动作列表。
- "action_uid" (str): 动作的唯一ID。
- "raw_html" (str): 动作执行前的页面原始HTML。
- "cleaned_html" (str): 动作执行前的页面清理后的HTML。
- "operation" (dict): 执行的操作。
  - "op" (str): 操作类型，如CLICK, TYPE, SELECT。
  - "original_op" (str): 原始操作类型。
  - "value" (str): 操作的可选值。
- "pos_candidates" (list[dict]): 真实元素列表。
  - "tag" (str): 元素标签。
  - "is_original_target" (bool): 是否为原始目标。
  - "is_top_level_target" (bool): 是否为顶级目标。
  - "backend_node_id" (str): 元素的唯一ID。
  - "attributes" (str): 元素属性的序列化字符串。
- "neg_candidates" (list[dict]): 预处理后的页面中的其他候选元素。

数据分割

训练集： 1,009个实例。
测试集：
- 跨任务：252个实例。
- 跨网站：177个实例。
- 跨领域：912个实例。

许可信息

许可证： Creative Commons Attribution 4.0 International License。

搜集汇总

数据集介绍

构建方式

Mind2Web数据集的构建旨在为开发和评估通用网络代理提供一个全面的资源。该数据集通过从137个网站中收集超过2000个开放式任务，涵盖了31个不同的领域，确保了任务的多样性和广泛性。每个任务都附有众包的操作序列，这些操作序列详细描述了如何在真实世界的网站上完成特定任务。此外，数据集还包含了任务执行前后的HTML代码，以及操作的具体类型和值，确保了数据的真实性和实用性。

使用方法

使用Mind2Web数据集时，研究者可以通过访问数据字段如'annotation_id'、'website'、'domain'等来获取任务的详细信息。数据集中的'actions'字段提供了完成任务所需的具体操作步骤，包括操作类型、值以及相关的HTML代码。研究者可以根据这些信息训练和评估通用网络代理，以实现更高效和智能的网页交互。

背景与挑战

背景概述

Mind2Web数据集由俄亥俄州立大学自然语言处理小组（OSU-NLP Group）于2023年创建，旨在推动通用网络代理的发展与评估。该数据集的核心研究问题是如何通过语言指令在任意网站上完成复杂任务，从而实现通用网络代理的目标。现有数据集多局限于模拟网站或特定网站和任务，无法满足通用网络代理的需求。Mind2Web通过收集来自137个网站、跨越31个领域的2000多个开放式任务，并结合众包的动作序列，提供了构建通用网络代理所需的多样化领域、真实网站和广泛的用户交互模式。这一数据集的发布对推动网络代理技术的进步具有重要意义。

当前挑战

Mind2Web数据集在构建过程中面临多项挑战。首先，如何从众多真实网站中收集并标注多样化且复杂的任务，确保任务的广泛性和代表性，是一个巨大的挑战。其次，由于涉及真实网站，数据集的构建需要处理复杂的HTML结构和用户交互模式，确保动作序列的准确性和可操作性。此外，数据集的划分需避免潜在的数据泄露，特别是在跨任务、跨网站和跨领域的测试集划分上，确保评估的公平性和有效性。这些挑战不仅反映了数据集构建的复杂性，也凸显了通用网络代理技术在实际应用中的难度。

常用场景

经典使用场景

Mind2Web数据集的经典使用场景在于开发和评估通用型网络代理，这些代理能够根据语言指令在任意网站上完成复杂任务。通过提供多样化的任务、真实的网站环境以及广泛的用户交互模式，该数据集为构建通用型网络代理提供了必要的资源。

解决学术问题

Mind2Web数据集解决了现有网络代理数据集在任务和网站覆盖范围上的局限性问题。通过涵盖31个领域、137个网站的2000多个开放式任务，该数据集为研究者提供了一个全面且真实的测试平台，推动了通用型网络代理的研究进展。

实际应用

在实际应用中，Mind2Web数据集可用于开发智能助手、自动化测试工具以及用户行为分析系统。通过模拟真实用户的交互行为，这些应用能够更有效地执行任务，提升用户体验，并优化网站功能。

数据集最近研究