intrinsec-ai/cstm-bench
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/intrinsec-ai/cstm-bench
下载链接
链接失效反馈官方服务:
资源简介:
CSTM-Bench是一个基准数据集,用于评估大型语言模型(LLMs)和AI代理是否能够检测跨多个会话展开的多阶段对抗性攻击。每个场景都是一系列对话,其中个别消息单独看起来是无害的——只有当读者在整个会话历史中进行推理时,威胁才变得明显。这测试了当前AI安全工具中的一个能力差距:今天的护栏和分类器是无状态的,在会话之间重置。像慢滴提示注入、组合性泄露或马赛克侦察这样的攻击正是利用了这一盲点——每一步都通过了每次会话的检查;只有聚合起来才是恶意的。数据集包含两个分割,分别针对不同的可行性问题,共享相同的54个场景骨架(26个攻击,14个良性原始,14个良性困难;50个纯文本+4个多模态)和相同的地面真实标签。
CSTM-Bench is a benchmark dataset for evaluating whether LLMs and AI agents can detect multi-stage adversarial attacks that unfold across many sessions. Each scenario is a sequence of conversations where individual messages look benign in isolation — the threat only becomes visible when the reader reasons across the full session history. This tests a capability gap in current AI safety tooling: todays guardrails and classifiers are stateless, resetting between sessions. Attacks like slow-drip prompt injection, compositional exfiltration, or mosaic reconnaissance exploit exactly this blind spot — each step passes a per-session check; only the aggregate is malicious. The dataset is released with two splits that answer complementary feasibility questions. Both splits share the same 54-scenario skeleton (26 Attack, 14 Benign-pristine, 14 Benign-hard; 50 text-only + 4 multimodal) and the same ground-truth labels.
提供机构:
intrinsec-ai



