hamdim/tcse-v2

Name: hamdim/tcse-v2
Creator: hamdim
Published: 2026-03-26 15:26:01
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/hamdim/tcse-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification - conversational language: - en tags: - safety - adversarial - multi-turn - llm-security - jailbreak - slow-drift - session-level pretty_name: "TCSE V2: Topic-Constrained Security Evaluation" size_categories: - n<1K --- # TCSE V2 — Topic-Constrained Security Evaluation Benchmark **Author:** Mustapha HAMDI — [mustapha.hamdi@innodeep.net](mailto:mustapha.hamdi@innodeep.net) **Organization:** [InnoDeep](https://innodeep.net) **Project:** SGEMAS Security Product © InnoDeep **License:** CC BY 4.0 --- **TCSE V2** is the first public benchmark specifically designed for evaluating **multi-turn slow-drift adversarial attacks** against Large Language Models. It is part of the **SGEMAS** (Self-Growing Ephemeral Multi-Agent Security) research project developed at InnoDeep. Unlike existing safety benchmarks (HarmBench, AdvBench, WildGuard) which evaluate single-turn prompts in isolation, TCSE V2 contains full **conversational sessions** where adversarial intent is revealed progressively across turns — the scenario that stateless guards are structurally unable to detect. --- ## Overview | Property | Value | |---|---| | Sessions | 503 | | Positive (malicious) | 313 (62.2%) | | Negative (benign) | 190 (37.8%) | | Unique topics | 60 | | Domains | 5 (+ 2 cross-domain) | | Turn range | 1 – 12 turns | | Language | English | | License | CC BY 4.0 | --- ## Domains | Domain | Sessions | |---|---| | Cybersecurity | 138 | | Finance | 95 | | Medical | 93 | | Infrastructure | 91 | | Social Engineering | 72 | | Finance + Cybersecurity (cross) | 7 | | Medical + Infrastructure (cross) | 7 | --- ## Session Types | Type | Description | |---|---| | `benign` | Legitimate multi-turn domain conversation | | `benign_hard` | Benign session with terminology that resembles attack language | | `attack_slow` | Slow-drift escalation — each turn rated individually benign | | `attack_direct` | Direct attack from turn 1 (single-turn control condition) | | `attack_slow_12` | Extended slow-drift (12 turns) | | `benign_12` | Extended benign (12 turns, length-matched control) | **Key property:** In `attack_slow` sessions, every individual turn is rated benign in isolation. The malicious intent only becomes apparent through the *trajectory* of the conversation — the defining challenge of the benchmark. --- ## Data Format Each record contains: ```python { "session_id": "s0042", # unique identifier "domain": "cybersecurity", # topic domain "topic": "OAuth2", # specific topic (60 unique) "session_type": "attack_slow", # see table above "label": 1, # 1=malicious, 0=benign "turns": ["turn 1 ...", "turn 2 ...", ...], # user messages "n_turns": 5, # number of turns "augmentation": "paraphrase_bridge", # augmentation strategy } ``` --- ## Augmentation Strategies TCSE V2 includes controlled paraphrase and length augmentations to test detector robustness: | Strategy | Description | |---|---| | `original` | Hand-curated session | | `length_Xt` / `length_Xt_ben` | Truncated to X turns (attack / benign) | | `paraphrase_bridge` | Bridge turn paraphrased | | `paraphrase_compressed` | Compressed wording | | `paraphrase_extended` | Extended wording | | `synthetic` | LLM-generated session | | `synthetic_bridge/compressed/extended` | Synthetic with paraphrase variant | | `gradual_Xt` | Gradual escalation in X turns | | `hard_benign` | Hard negative: benign session with security terminology | | `long_12t` | 12-turn session | | `cross_domain` | Cross-domain topic transfer | --- ## Benchmark Design Rationale **Topic-constrained evaluation** forces detectors to discriminate *intent* from *topic*. Both benign and malicious sessions cover the same topic (e.g., OAuth2, Pharmacology, Power Grid Control), making surface-level keyword matching insufficient. The **Topic-Δ** metric quantifies intent isolation: ``` Topic-Δ = sloDR(attack_slow) - FPR(benign) [within same topic] ``` A detector with Topic-Δ > 0 genuinely separates malicious intent from legitimate domain expertise. --- ## Baseline Results (SGEMAS V8.5) | Method | AUC | sloDR@5%FPR | F1 | |---|---|---|---| | SGEMAS V8.5 | **0.843** | **62.6%** | **0.757** | | EWMA | 0.574 | 34.5% | 0.503 | | CUSUM | 0.572 | 31.9% | 0.474 | | EWMA-MSS (no gate) | 0.551 | 28.1% | 0.429 | | Page-Hinkley | 0.500 | 0.0% | 0.000 | Full evaluation code: [github.com/hamdim/sgemas](https://github.com/hamdim/sgemas) --- ## Citation If you use TCSE V2 in your research, please cite: ```bibtex @article{sgemas2025, title = {SGEMAS: A Session-Level Energy-Based Framework for Multi-Turn Adversarial Drift Detection in LLMs}, author = {Anonymous}, journal = {arXiv preprint}, year = {2025}, } ``` *(Citation will be updated upon publication.)* --- ## License TCSE V2 is released under the **Creative Commons Attribution 4.0 (CC BY 4.0)** license. You are free to use, share, and adapt the dataset with appropriate credit.

提供机构：

hamdim

5,000+

优质数据集

54 个

任务类型

进入经典数据集