CUA-HandCrafted Browser Benchmark (NeurIPS 2026 E&D)

Name: CUA-HandCrafted Browser Benchmark (NeurIPS 2026 E&D)
Creator: Zenodo
Published: 2026-05-05 19:14:59
License: 暂无描述

DataCite Commons2026-05-05 更新2026-05-07 收录

下载链接：

https://zenodo.org/doi/10.5281/zenodo.20034379

下载链接

链接失效反馈

官方服务：

资源简介：

CUA-HandCrafted is a 793-episode public benchmark for adversarial evaluation of computer-using agents (CUAs). It contains 24 multi-step web tasks across 8 self-hosted HTML/CSS/JS sites (HR Portal, Project Dashboard, CRM/HelpDesk, Banking, Shopping, Email, Forum, Settings); 56 attack templates spanning 9 attack families (denial_of_service, multi_step, unauthorized_action, data_exfiltration, goal_hijacking, credential_phishing, social_engineering, authority_impersonation, plus literature-informed Phase 9 reproductions of RL-Hammer, WASP, TRAP, and MUZZLE techniques) and 5 depth levels; 5 injection channels (hidden_text, popup, visible_text, dom_modify, help_text); and 4 system-prompt configurations (L0_bare, L1_helpful, L2_default, L3_hardened). Each episode is a per-trajectory JSON log with action sequence, canary detection outcome, BU/UuA/ASR/Safety metrics, and pinned provider/model version. Bundled with 158 episodes from a SkillBench coding-agent cross-domain reference for the domain-conditioned safety result reported in the accompanying paper. Headline finding: 0/140 multi-step ASR on Claude Sonnet 4.6 and GPT-5.4 in browser domain (Clopper-Pearson UB 2.6%), but the same weights fall up to 100% to hand-crafted skill-injection on a coding agent. Companion Croissant 1.0 metadata file is provided as a separate download.

提供机构：

Zenodo

创建时间：

2026-05-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集