magicgh/MT-Mind2Web

Name: magicgh/MT-Mind2Web
Creator: magicgh
Published: 2024-02-23 02:38:22
License: 暂无描述

Hugging Face2024-02-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/magicgh/MT-Mind2Web

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en pretty_name: MT-Mind2Web tags: - web navigation - conversation --- # MT-Mind2Web Dataset MT-Mind2Web is constructed by using the single-turn interactions from [Mind2Web](https://huggingface.co/datasets/osunlp/Mind2Web), an expert-annotated web navigation dataset, as the guidance to construct conversation sessions. ## Statistics | | Train | Test-Task | Test-Website | Test-Subdomain | |--------------------|-------|-----------|--------------|----------------| | # Conversations | 600 | 34 | 42 | 44 | | # Turns | 2,896 | 191 | 218 | 216 | | Avg. # Turn/Conv. | 4.83 | 5.62 | 5.19 | 4.91 | | Avg. # Action/Turn | 2.95 | 3.16 | 3.01 | 3.07 | | Avg. # Element/Turn| 573.8 | 626.3 | 620.6 | 759.4 | | Avg. Inst. Length | 36.3 | 37.4 | 39.8 | 36.2 | | Avg. HTML Length | 169K | 195K | 138K | 397K | ## Dataset Structure - "task_id" (str): unique id for each task - "website" (str): website name - "domain" (str): website domain - "subdomain" (str): website subdomain - "turns" (list[dict]): list of subtasks - "annotation_id" (str): unique id for each subtask - "confirmed_task" (str): subtask description - "action_reprs" (list[str]): human readable string representation of the action sequence - "actions" (list[dict]): list of actions (steps) to complete the subtask - "action_uid" (str): unique id for each action (step) - "raw_html" (str): raw html of the page before the action is performed - "cleaned_html" (str): cleaned html of the page before the action is performed - "operation" (dict): operation to perform - "op" (str): operation type, one of CLICK, TYPE, SELECT - "original_op" (str): original operation type, contain additional HOVER and ENTER that are mapped to CLICK, not used - "value" (str): optional value for the operation, e.g., text to type, option to select - "pos_candidates" (list[dict]): ground truth elements. Here we only include positive elements that exist in "cleaned_html" after our preprocessing, so "pos_candidates" might be empty. The original labeled element can always be found in the "raw_html". - "tag" (str): tag of the element - "is_original_target" (bool): whether the element is the original target labeled by the annotator - "is_top_level_target" (bool): whether the element is a top level target find by our algorithm. please see the paper for more details. - "backend_node_id" (str): unique id for the element - "attributes" (str): serialized attributes of the element, use `json.loads` to convert back to dict - "neg_candidates" (list[dict]): other candidate elements in the page after preprocessing, has similar structure as "pos_candidates"

提供机构：

magicgh

原始信息汇总

MT-Mind2Web 数据集

MT-Mind2Web 数据集是通过使用 Mind2Web 的单轮交互作为指导，构建对话会话而建立的。Mind2Web 是一个专家注释的网页导航数据集。

统计信息

	训练集	测试-任务	测试-网站	测试-子域
# 对话数	600	34	42	44
# 轮数	2,896	191	218	216
平均轮数/对话	4.83	5.62	5.19	4.91
平均动作数/轮	2.95	3.16	3.01	3.07
平均元素数/轮	573.8	626.3	620.6	759.4
平均指令长度	36.3	37.4	39.8	36.2
平均HTML长度	169K	195K	138K	397K

数据集结构

"task_id" (str): 每个任务的唯一ID
"website" (str): 网站名称
"domain" (str): 网站域名
"subdomain" (str): 网站子域名
"turns" (list[dict]): 子任务列表
- "annotation_id" (str): 每个子任务的唯一ID
- "confirmed_task" (str): 子任务描述
- "action_reprs" (list[str]): 动作序列的人类可读字符串表示
- "actions" (list[dict]): 完成子任务的动作（步骤）列表
  - "action_uid" (str): 每个动作（步骤）的唯一ID
  - "raw_html" (str): 执行动作前的原始HTML
  - "cleaned_html" (str): 执行动作前的清理后的HTML
  - "operation" (dict): 要执行的操作
    - "op" (str): 操作类型，包括 CLICK, TYPE, SELECT
    - "original_op" (str): 原始操作类型，包含额外的 HOVER 和 ENTER，映射到 CLICK，未使用
    - "value" (str): 操作的可选值，例如要输入的文本，要选择的选项
  - "pos_candidates" (list[dict]): 地面真实元素。这里只包括预处理后存在于 "cleaned_html" 中的正元素，因此 "pos_candidates" 可能为空。原始标记的元素始终可以在 "raw_html" 中找到。
    - "tag" (str): 元素的标签
    - "is_original_target" (bool): 元素是否为注释者标记的原始目标
    - "is_top_level_target" (bool): 元素是否为我们的算法找到的顶级目标。详情请参见论文。
    - "backend_node_id" (str): 元素的唯一ID
    - "attributes" (str): 元素的序列化属性，使用 json.loads 转换回字典
  - "neg_candidates" (list[dict]): 预处理后页面中的其他候选元素，具有与 "pos_candidates" 类似的结构

5,000+

优质数据集

54 个

任务类型

进入经典数据集