five

CUA-HandCrafted Browser Benchmark (NeurIPS 2026 E&D)

收藏
DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20034379
下载链接
链接失效反馈
官方服务:
资源简介:
CUA-HandCrafted is a 793-episode public benchmark for adversarial evaluation of computer-using agents (CUAs). It contains 24 multi-step web tasks across 8 self-hosted HTML/CSS/JS sites (HR Portal, Project Dashboard, CRM/HelpDesk, Banking, Shopping, Email, Forum, Settings); 56 attack templates spanning 9 attack families (denial_of_service, multi_step, unauthorized_action, data_exfiltration, goal_hijacking, credential_phishing, social_engineering, authority_impersonation, plus literature-informed Phase 9 reproductions of RL-Hammer, WASP, TRAP, and MUZZLE techniques) and 5 depth levels; 5 injection channels (hidden_text, popup, visible_text, dom_modify, help_text); and 4 system-prompt configurations (L0_bare, L1_helpful, L2_default, L3_hardened). Each episode is a per-trajectory JSON log with action sequence, canary detection outcome, BU/UuA/ASR/Safety metrics, and pinned provider/model version. Bundled with 158 episodes from a SkillBench coding-agent cross-domain reference for the domain-conditioned safety result reported in the accompanying paper. Headline finding: 0/140 multi-step ASR on Claude Sonnet 4.6 and GPT-5.4 in browser domain (Clopper-Pearson UB 2.6%), but the same weights fall up to 100% to hand-crafted skill-injection on a coding agent. Companion Croissant 1.0 metadata file is provided as a separate download.
提供机构:
Zenodo
创建时间:
2026-05-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作