intrinsec-ai/cstm-bench

Name: intrinsec-ai/cstm-bench
Creator: intrinsec-ai
Published: 2026-04-24 01:18:47
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/intrinsec-ai/cstm-bench

下载链接

链接失效反馈

官方服务：

资源简介：

CSTM-Bench是一个基准数据集，用于评估大型语言模型（LLMs）和AI代理是否能够检测跨多个会话展开的多阶段对抗性攻击。每个场景都是一系列对话，其中个别消息单独看起来是无害的——只有当读者在整个会话历史中进行推理时，威胁才变得明显。这测试了当前AI安全工具中的一个能力差距：今天的护栏和分类器是无状态的，在会话之间重置。像慢滴提示注入、组合性泄露或马赛克侦察这样的攻击正是利用了这一盲点——每一步都通过了每次会话的检查；只有聚合起来才是恶意的。数据集包含两个分割，分别针对不同的可行性问题，共享相同的54个场景骨架（26个攻击，14个良性原始，14个良性困难；50个纯文本+4个多模态）和相同的地面真实标签。

CSTM-Bench is a benchmark dataset for evaluating whether LLMs and AI agents can detect multi-stage adversarial attacks that unfold across many sessions. Each scenario is a sequence of conversations where individual messages look benign in isolation — the threat only becomes visible when the reader reasons across the full session history. This tests a capability gap in current AI safety tooling: todays guardrails and classifiers are stateless, resetting between sessions. Attacks like slow-drip prompt injection, compositional exfiltration, or mosaic reconnaissance exploit exactly this blind spot — each step passes a per-session check; only the aggregate is malicious. The dataset is released with two splits that answer complementary feasibility questions. Both splits share the same 54-scenario skeleton (26 Attack, 14 Benign-pristine, 14 Benign-hard; 50 text-only + 4 multimodal) and the same ground-truth labels.

提供机构：

intrinsec-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集