magicgh/MT-Mind2Web
收藏Hugging Face2024-02-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/magicgh/MT-Mind2Web
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
pretty_name: MT-Mind2Web
tags:
- web navigation
- conversation
---
# MT-Mind2Web Dataset
MT-Mind2Web is constructed by using the single-turn interactions from [Mind2Web](https://huggingface.co/datasets/osunlp/Mind2Web), an expert-annotated web navigation dataset, as the guidance to construct conversation sessions.
## Statistics
| | Train | Test-Task | Test-Website | Test-Subdomain |
|--------------------|-------|-----------|--------------|----------------|
| # Conversations | 600 | 34 | 42 | 44 |
| # Turns | 2,896 | 191 | 218 | 216 |
| Avg. # Turn/Conv. | 4.83 | 5.62 | 5.19 | 4.91 |
| Avg. # Action/Turn | 2.95 | 3.16 | 3.01 | 3.07 |
| Avg. # Element/Turn| 573.8 | 626.3 | 620.6 | 759.4 |
| Avg. Inst. Length | 36.3 | 37.4 | 39.8 | 36.2 |
| Avg. HTML Length | 169K | 195K | 138K | 397K |
## Dataset Structure
- "task_id" (str): unique id for each task
- "website" (str): website name
- "domain" (str): website domain
- "subdomain" (str): website subdomain
- "turns" (list[dict]): list of subtasks
- "annotation_id" (str): unique id for each subtask
- "confirmed_task" (str): subtask description
- "action_reprs" (list[str]): human readable string representation of the action sequence
- "actions" (list[dict]): list of actions (steps) to complete the subtask
- "action_uid" (str): unique id for each action (step)
- "raw_html" (str): raw html of the page before the action is performed
- "cleaned_html" (str): cleaned html of the page before the action is performed
- "operation" (dict): operation to perform
- "op" (str): operation type, one of CLICK, TYPE, SELECT
- "original_op" (str): original operation type, contain additional HOVER and ENTER that are mapped to CLICK, not used
- "value" (str): optional value for the operation, e.g., text to type, option to select
- "pos_candidates" (list[dict]): ground truth elements. Here we only include positive elements that exist in "cleaned_html" after our preprocessing, so "pos_candidates" might be empty. The original labeled element can always be found in the "raw_html".
- "tag" (str): tag of the element
- "is_original_target" (bool): whether the element is the original target labeled by the annotator
- "is_top_level_target" (bool): whether the element is a top level target find by our algorithm. please see the paper for more details.
- "backend_node_id" (str): unique id for the element
- "attributes" (str): serialized attributes of the element, use `json.loads` to convert back to dict
- "neg_candidates" (list[dict]): other candidate elements in the page after preprocessing, has similar structure as "pos_candidates"
提供机构:
magicgh
原始信息汇总
MT-Mind2Web 数据集
MT-Mind2Web 数据集是通过使用 Mind2Web 的单轮交互作为指导,构建对话会话而建立的。Mind2Web 是一个专家注释的网页导航数据集。
统计信息
| 训练集 | 测试-任务 | 测试-网站 | 测试-子域 | |
|---|---|---|---|---|
| # 对话数 | 600 | 34 | 42 | 44 |
| # 轮数 | 2,896 | 191 | 218 | 216 |
| 平均轮数/对话 | 4.83 | 5.62 | 5.19 | 4.91 |
| 平均动作数/轮 | 2.95 | 3.16 | 3.01 | 3.07 |
| 平均元素数/轮 | 573.8 | 626.3 | 620.6 | 759.4 |
| 平均指令长度 | 36.3 | 37.4 | 39.8 | 36.2 |
| 平均HTML长度 | 169K | 195K | 138K | 397K |
数据集结构
- "task_id" (str): 每个任务的唯一ID
- "website" (str): 网站名称
- "domain" (str): 网站域名
- "subdomain" (str): 网站子域名
- "turns" (list[dict]): 子任务列表
- "annotation_id" (str): 每个子任务的唯一ID
- "confirmed_task" (str): 子任务描述
- "action_reprs" (list[str]): 动作序列的人类可读字符串表示
- "actions" (list[dict]): 完成子任务的动作(步骤)列表
- "action_uid" (str): 每个动作(步骤)的唯一ID
- "raw_html" (str): 执行动作前的原始HTML
- "cleaned_html" (str): 执行动作前的清理后的HTML
- "operation" (dict): 要执行的操作
- "op" (str): 操作类型,包括 CLICK, TYPE, SELECT
- "original_op" (str): 原始操作类型,包含额外的 HOVER 和 ENTER,映射到 CLICK,未使用
- "value" (str): 操作的可选值,例如要输入的文本,要选择的选项
- "pos_candidates" (list[dict]): 地面真实元素。这里只包括预处理后存在于 "cleaned_html" 中的正元素,因此 "pos_candidates" 可能为空。原始标记的元素始终可以在 "raw_html" 中找到。
- "tag" (str): 元素的标签
- "is_original_target" (bool): 元素是否为注释者标记的原始目标
- "is_top_level_target" (bool): 元素是否为我们的算法找到的顶级目标。详情请参见论文。
- "backend_node_id" (str): 元素的唯一ID
- "attributes" (str): 元素的序列化属性,使用
json.loads转换回字典
- "neg_candidates" (list[dict]): 预处理后页面中的其他候选元素,具有与 "pos_candidates" 类似的结构



