five

magicgh/MT-Mind2Web

收藏
Hugging Face2024-02-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/magicgh/MT-Mind2Web
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en pretty_name: MT-Mind2Web tags: - web navigation - conversation --- # MT-Mind2Web Dataset MT-Mind2Web is constructed by using the single-turn interactions from [Mind2Web](https://huggingface.co/datasets/osunlp/Mind2Web), an expert-annotated web navigation dataset, as the guidance to construct conversation sessions. ## Statistics | | Train | Test-Task | Test-Website | Test-Subdomain | |--------------------|-------|-----------|--------------|----------------| | # Conversations | 600 | 34 | 42 | 44 | | # Turns | 2,896 | 191 | 218 | 216 | | Avg. # Turn/Conv. | 4.83 | 5.62 | 5.19 | 4.91 | | Avg. # Action/Turn | 2.95 | 3.16 | 3.01 | 3.07 | | Avg. # Element/Turn| 573.8 | 626.3 | 620.6 | 759.4 | | Avg. Inst. Length | 36.3 | 37.4 | 39.8 | 36.2 | | Avg. HTML Length | 169K | 195K | 138K | 397K | ## Dataset Structure - "task_id" (str): unique id for each task - "website" (str): website name - "domain" (str): website domain - "subdomain" (str): website subdomain - "turns" (list[dict]): list of subtasks - "annotation_id" (str): unique id for each subtask - "confirmed_task" (str): subtask description - "action_reprs" (list[str]): human readable string representation of the action sequence - "actions" (list[dict]): list of actions (steps) to complete the subtask - "action_uid" (str): unique id for each action (step) - "raw_html" (str): raw html of the page before the action is performed - "cleaned_html" (str): cleaned html of the page before the action is performed - "operation" (dict): operation to perform - "op" (str): operation type, one of CLICK, TYPE, SELECT - "original_op" (str): original operation type, contain additional HOVER and ENTER that are mapped to CLICK, not used - "value" (str): optional value for the operation, e.g., text to type, option to select - "pos_candidates" (list[dict]): ground truth elements. Here we only include positive elements that exist in "cleaned_html" after our preprocessing, so "pos_candidates" might be empty. The original labeled element can always be found in the "raw_html". - "tag" (str): tag of the element - "is_original_target" (bool): whether the element is the original target labeled by the annotator - "is_top_level_target" (bool): whether the element is a top level target find by our algorithm. please see the paper for more details. - "backend_node_id" (str): unique id for the element - "attributes" (str): serialized attributes of the element, use `json.loads` to convert back to dict - "neg_candidates" (list[dict]): other candidate elements in the page after preprocessing, has similar structure as "pos_candidates"
提供机构:
magicgh
原始信息汇总

MT-Mind2Web 数据集

MT-Mind2Web 数据集是通过使用 Mind2Web 的单轮交互作为指导,构建对话会话而建立的。Mind2Web 是一个专家注释的网页导航数据集。

统计信息

训练集 测试-任务 测试-网站 测试-子域
# 对话数 600 34 42 44
# 轮数 2,896 191 218 216
平均轮数/对话 4.83 5.62 5.19 4.91
平均动作数/轮 2.95 3.16 3.01 3.07
平均元素数/轮 573.8 626.3 620.6 759.4
平均指令长度 36.3 37.4 39.8 36.2
平均HTML长度 169K 195K 138K 397K

数据集结构

  • "task_id" (str): 每个任务的唯一ID
  • "website" (str): 网站名称
  • "domain" (str): 网站域名
  • "subdomain" (str): 网站子域名
  • "turns" (list[dict]): 子任务列表
    • "annotation_id" (str): 每个子任务的唯一ID
    • "confirmed_task" (str): 子任务描述
    • "action_reprs" (list[str]): 动作序列的人类可读字符串表示
    • "actions" (list[dict]): 完成子任务的动作(步骤)列表
      • "action_uid" (str): 每个动作(步骤)的唯一ID
      • "raw_html" (str): 执行动作前的原始HTML
      • "cleaned_html" (str): 执行动作前的清理后的HTML
      • "operation" (dict): 要执行的操作
        • "op" (str): 操作类型,包括 CLICK, TYPE, SELECT
        • "original_op" (str): 原始操作类型,包含额外的 HOVER 和 ENTER,映射到 CLICK,未使用
        • "value" (str): 操作的可选值,例如要输入的文本,要选择的选项
      • "pos_candidates" (list[dict]): 地面真实元素。这里只包括预处理后存在于 "cleaned_html" 中的正元素,因此 "pos_candidates" 可能为空。原始标记的元素始终可以在 "raw_html" 中找到。
        • "tag" (str): 元素的标签
        • "is_original_target" (bool): 元素是否为注释者标记的原始目标
        • "is_top_level_target" (bool): 元素是否为我们的算法找到的顶级目标。详情请参见论文。
        • "backend_node_id" (str): 元素的唯一ID
        • "attributes" (str): 元素的序列化属性,使用 json.loads 转换回字典
      • "neg_candidates" (list[dict]): 预处理后页面中的其他候选元素,具有与 "pos_candidates" 类似的结构
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作