NAIL-Group/ClawBench

Name: NAIL-Group/ClawBench
Creator: NAIL-Group
Published: 2026-04-10 12:41:07
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/NAIL-Group/ClawBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - web-agents - benchmark - evaluation - browser-automation pretty_name: ClawBench size_categories: - n<1K configs: - config_name: default data_files: - split: test path: data/train-00000-of-00001.parquet arxiv: "2604.08523" --- # ClawBench **Can AI Agents Complete Everyday Online Tasks?** ClawBench evaluates AI agents on **153 everyday tasks** (such as booking flights, ordering groceries, submitting job applications) across **144 live websites**. We capture **5 layers of behavioral data** (session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions), collect human ground-truth for every task, and score with an agentic evaluator that provides step-level traceable diagnostics. | | | |---|---| | **Paper** | [arXiv:2604.08523](https://arxiv.org/abs/2604.08523) | | **Website** | [claw-bench.com](https://claw-bench.com) | | **Code** | [github.com/reacher-z/ClawBench](https://github.com/reacher-z/ClawBench) | ## Dataset Structure ### Columns | Column | Type | Description | |--------|------|-------------| | `task_id` | int | Unique task identifier | | `instruction` | string | Task prompt sent to the agent | | `metaclass` | string | High-level category (21 categories) | | `class` | string | Fine-grained sub-category | | `platform` | string | Target platform (144 unique platforms) | | `sites` | list[string] | Domains involved in the task | | `eval_schema` | string (JSON) | Request interception configuration | | `time_limit` | int | Maximum time in minutes | | `extra_info` | string (JSON) | Paths to additional context files | | `shared_info` | string | Path to shared user profile | ### Additional Files ``` shared/ alex_green_personal_info.json # Shared dummy user profile used across all tasks extra_info/ 004/grocery_list.json # Task-specific context (32 tasks have extra info) 007/meal_plan.json 043/pet_info.json ... ``` - **`shared/alex_green_personal_info.json`** — A comprehensive dummy user persona (Alex Green) including personal details, address, work history, education, financial information, and preferences. All tasks share this identity. - **`extra_info/`** — Task-specific supplementary files referenced by the `extra_info` column. 32 of 153 tasks include additional context such as grocery lists, job links, meeting details, etc. ### eval_schema The `eval_schema` field configures the **request interceptor** — a mechanism that blocks the final HTTP request matching the specified URL pattern and method, preventing irreversible actions (checkout, form submission, etc.) from reaching the server. This allows safe evaluation on live websites. ```json { "url_pattern": "taskrabbit\\.(com|ca)/(api/v\\d+/jobs|book/\\d+/confirm)", "method": "POST" } ``` ## Task Categories (metaclass) | Category | Tasks | Example Platforms | |----------|-------|-------------------| | daily-life | 21 | Uber Eats, Instacart, Zillow | | entertainment-hobbies | 15 | Goodreads, Eventbrite, Fandango | | creation-init | 13 | ClickUp, Typeform, Ghost | | office-secretary-tasks | 9 | Trello, Calendly, Purelymail | | rating-voting | 10 | TripAdvisor, Glassdoor, Yelp | | education-learning | 9 | Coursera, LeetCode, Blinkist | | travel | 9 | Google Flights, Hipcamp, Airbnb | | beauty-personal-care | 9 | TaskRabbit, Booksy, Soko Glam | | pet-animal-care | 8 | Rover, Petfinder, Chewy | | job-search-hr | 8 | Indeed, Greenhouse, ZipRecruiter | | academia-research | 5 | Zotero, Overleaf, Google Scholar | | and 10 more... | | | ## Usage ```python from datasets import load_dataset ds = load_dataset("NAIL-Group/ClawBench", split="test") print(ds[0]) ``` ## Citation ```bibtex @article{zhang2026clawbench, title={ClawBench: Can AI Agents Complete Everyday Online Tasks?}, author={Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen}, journal={arXiv preprint arXiv:2604.08523}, year={2026} } ```

许可证：Apache-2.0 任务类别： - 文本生成语言： - 英语标签： - 网页智能体（web-agents） - 基准测试（benchmark） - 评估（evaluation） - 浏览器自动化（browser-automation）展示名称：ClawBench 数据规模分类： - n<1000 配置项： - 配置名称：default 数据文件： - 拆分方式：测试集（test）路径：data/train-00000-of-00001.parquet arxiv编号："2604.08523" # ClawBench **AI智能体（AI Agent）能否完成日常线上任务？** ClawBench是一项用于评估AI智能体的基准测试，它在144个活跃网站上开展153项日常任务测试，任务涵盖预订航班、订购杂货、提交求职申请等场景。该数据集收集了5层行为数据：会话回放、屏幕截图、HTTP流量、智能体推理轨迹以及浏览器操作；为每项任务采集人工标注的真实结果，并使用可提供步骤级可追溯诊断的智能体评估器进行评分。 | | | |---|---| | **论文** | [arXiv:2604.08523](https://arxiv.org/abs/2604.08523) | | **项目官网** | [claw-bench.com](https://claw-bench.com) | | **代码仓库** | [github.com/reacher-z/ClawBench](https://github.com/reacher-z/ClawBench) | ## 数据集结构 ### 数据列 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `task_id` | int | 任务唯一标识符 | | `instruction` | string | 发送给智能体的任务提示词 | | `metaclass` | string | 高级任务类别（共21个类别） | | `class` | string | 细粒度子类别 | | `platform` | string | 目标平台（共144个独特平台） | | `sites` | list[string] | 任务涉及的域名列表 | | `eval_schema` | string (JSON) | 请求拦截配置 | | `time_limit` | int | 最大允许时长（单位：分钟） | | `extra_info` | string (JSON) | 额外上下文文件的存储路径 | | `shared_info` | string | 共享用户配置文件的存储路径 | ### 附加文件 shared/ alex_green_personal_info.json # 所有任务通用的虚拟用户配置文件 extra_info/ 004/grocery_list.json # 任务特定上下文（共32项任务附带额外信息） 007/meal_plan.json 043/pet_info.json ... - **`shared/alex_green_personal_info.json`**：一份完整的虚拟用户角色（亚历克斯·格林），包含个人信息、住址、工作经历、教育背景、财务状况以及个人偏好，所有测试任务均复用该用户身份。 - **`extra_info/`**：由`extra_info`列引用的任务专属补充文件。153项任务中有32项附带此类额外上下文，例如购物清单、求职链接、会议详情等。 ### eval_schema 说明 `eval_schema`字段用于配置**请求拦截器**——该机制会拦截匹配指定URL模式与请求方法的最终HTTP请求，避免结账、表单提交等不可逆操作抵达服务器，从而实现在活跃网站上的安全评估。 json { "url_pattern": "taskrabbit\.(com|ca)/(api/v\d+/jobs|book/\d+/confirm)", "method": "POST" } ## 任务类别（metaclass） | 类别 | 任务数量 | 示例平台 | |----------|-------|-------------------| | 日常生活 | 21 | Uber Eats、Instacart、Zillow | | 娱乐与爱好 | 15 | Goodreads、Eventbrite、Fandango | | 创作启动 | 13 | ClickUp、Typeform、Ghost | | 办公室秘书任务 | 9 | Trello、Calendly、Purelymail | | 评分与投票 | 10 | TripAdvisor、Glassdoor、Yelp | | 教育与学习 | 9 | Coursera、LeetCode、Blinkist | | 出行 | 9 | Google Flights、Hipcamp、Airbnb | | 美容与个人护理 | 9 | TaskRabbit、Booksy、Soko Glam | | 宠物护理 | 8 | Rover、Petfinder、Chewy | | 求职与人力资源 | 8 | Indeed、Greenhouse、ZipRecruiter | | 学术研究 | 5 | Zotero、Overleaf、Google Scholar | | 另有10个类别 | | | ## 使用示例 python from datasets import load_dataset ds = load_dataset("NAIL-Group/ClawBench", split="test") print(ds[0]) ## 引用格式 bibtex @article{zhang2026clawbench, title={ClawBench: Can AI Agents Complete Everyday Online Tasks?}, author={Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen}, journal={arXiv preprint arXiv:2604.08523}, year={2026} }

提供机构：

NAIL-Group

5,000+

优质数据集

54 个

任务类型

进入经典数据集