five

FailureScope Adversarial Corpus (NeurIPS 2026 E&D)

收藏
DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20034376
下载链接
链接失效反馈
官方服务:
资源简介:
FailureScope adversarial corpus: 630 multi-turn agent traces across three attack families (split-action, cognitive-load, authority-impersonation) and three frontier models (Claude Sonnet 4.6, GPT-5.4, DeepSeek-v3.2) against a sandboxed code-execution environment. Records contain full agent trajectories, per-step refusal/compliance labels, LLM-judge attack-success-rate (ASR) labels, and ground-truth network-execution outcomes. Surfaces a 73 to 100 percentage-point gap between LLM-judge ASR and real network execution on cognitive-load attacks, plus per-model behavioral profiles including a Claude Sonnet 4.6 selective-refusal signal (0/30 judge ASR with 21/30 step-2 refusals) on split-action attacks where GPT-5.4 fails at 92%. Includes 1,203 generated adversarial task variants used during attack-template development. This dataset is one of three components of the FailureScope release; see related identifiers on the umbrella record DOI 10.5281/zenodo.20037167.
提供机构:
Zenodo
创建时间:
2026-05-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作