pskulkarni/lad-multiturn-adversarial
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/pskulkarni/lad-multiturn-adversarial
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-classification
tags:
- safety
- adversarial
- multi-turn
- jailbreak
- activation-probing
- llm-security
pretty_name: LAD Multi-Turn Adversarial Dataset
size_categories:
- 1K<n<10K
---
# LAD Multi-Turn Adversarial Dataset
Synthetic multi-turn conversations with **three-phase turn-level labels** (benign/pivoting/adversarial) for training adversarial intent detection probes on LLM activations.
## Overview
| Split | Conversations | Turns | Adversarial | Benign |
|-------|-------------:|------:|------------:|-------:|
| Train | 1,125 | 13,528 | 885 | 240 |
| Test | 797 | 9,142 | 597 | 200 |
| **Total** | **1,922** | **22,670** | **1,482** | **440** |
## Categories
### Adversarial (6 attack types)
| Category | Train | Test | Description |
|----------|------:|-----:|-------------|
| gradual_escalation | 143 | 99 | Reconnaissance to exploitation progression |
| trust_building | 149 | 100 | Social engineering through rapport |
| context_poisoning | 149 | 100 | Injecting misleading context |
| role_accumulation | 149 | 99 | Privilege escalation via role assignment |
| instruction_fragmentation | 145 | 100 | Splitting harmful requests across turns |
| tool_use_exploitation | 150 | 99 | Abusing tool/API access |
### Benign (4 types)
| Category | Train | Test | Description |
|----------|------:|-----:|-------------|
| safe_academic_research | 60 | 50 | Legitimate research discussions |
| safe_creative_writing | 60 | 50 | Fiction and storytelling |
| safe_professional_security | 60 | 50 | Defensive cybersecurity |
| safe_technical_discussion | 60 | 50 | General technical topics |
## Three-Phase Turn Labels
Each turn carries one of three labels:
- **benign**: Normal conversational turn (safe content)
- **pivoting**: The attacker is steering toward adversarial intent but has not yet made an explicit harmful request. This is the critical detection window.
- **adversarial**: Overtly harmful request or content
This three-phase labeling is unique among multi-turn safety datasets. Existing benchmarks (SafeDialBench, LMSYS-Chat-1M, MHJ) provide only binary safe/unsafe labels. The pivoting label enables **early detection** -- training probes to flag conversations *during* the steering phase, before the first adversarial turn.
## Schema
```json
{
"conversation_id": 10,
"category": "gradual_escalation",
"is_adversarial": true,
"description": "Multi-turn conversation about cloud security that gradually escalates...",
"turns": [
{
"role": "user",
"content": "What are the top cloud providers for enterprise workloads?",
"label": "benign",
"turn_number": 1
},
{
"role": "assistant",
"content": "AWS, Microsoft Azure, and Google Cloud Platform dominate...",
"label": "benign",
"turn_number": 2
}
]
}
```
## Generation
Conversations generated using Qwen3-235B (`Qwen/Qwen3-235B-A22B`) self-hosted on RunPod (2xH200, vLLM) with structured prompting to produce turn-level phase annotations. Attack categories are mapped to HACCA (Highly Autonomous Cyber-Capable Agents) tactical parallels.
## Intended Use
- Training activation probes for multi-turn adversarial intent detection
- Benchmarking multi-turn safety classifiers
- Studying adversarial escalation patterns in LLM conversations
## Citation
```bibtex
@article{kulkarni2026lad,
title={Latent Adversarial Detection: Detecting Multi-Turn Attacks via Activation Trajectory Drift},
author={Kulkarni, Prashant},
journal={arXiv preprint},
year={2026}
}
```
---
language:
- 英语
license: apache-2.0
task_categories:
- 文本分类
tags:
- 安全
- 对抗
- 多轮
- 越狱攻击(jailbreak)
- 激活探测(activation-probing)
- 大语言模型安全(LLM-security)
pretty_name: LAD多轮对抗数据集
size_categories:
- 1000 < 样本数 < 10000
---
# LAD多轮对抗数据集
合成多轮对话,带有**三阶段轮次标签**(良性/转向/对抗),用于在大语言模型(LLM)激活上训练对抗意图探测模型。
## 概述
| 划分集 | 对话数 | 总轮次数 | 对抗对话数 | 良性对话数 |
|-------|-------------:|------:|------------:|-------:|
| 训练集 | 1,125 | 13,528 | 885 | 240 |
| 测试集 | 797 | 9,142 | 597 | 200 |
| **总计** | **1,922** | **22,670** | **1,482** | **440** |
## 类别
### 对抗(6种攻击类型)
| 攻击类别 | 训练集数量 | 测试集数量 | 描述 |
|----------|------:|-----:|-------------|
| 渐进式升级(gradual_escalation) | 143 | 99 | 从侦察到利用的递进过程 |
| 信任构建(trust_building) | 149 | 100 | 通过建立好感实施社会工程攻击 |
| 上下文投毒(context_poisoning) | 149 | 100 | 注入误导性上下文 |
| 权限累积(role_accumulation) | 149 | 99 | 通过角色分配实现权限升级 |
| 指令拆分(instruction_fragmentation) | 145 | 100 | 在多轮对话中拆分有害请求 |
| 工具利用(tool_use_exploitation) | 150 | 99 | 滥用工具/API访问权限 |
### 良性(4种类型)
| 类别 | 训练集数量 | 测试集数量 | 描述 |
|----------|------:|-----:|-------------|
| 安全学术研讨(safe_academic_research) | 60 | 50 | 合法的研究讨论 |
| 安全创意写作(safe_creative_writing) | 60 | 50 | 小说与故事创作 |
| 安全专业防御性网络安全(safe_professional_security) | 60 | 50 | 防御性网络安全相关内容 |
| 安全技术讨论(safe_technical_discussion) | 60 | 50 | 通用技术议题讨论 |
## 三阶段轮次标签
每个轮次带有以下三种标签之一:
- **良性(benign)**:正常对话轮次,内容安全合规
- **转向(pivoting)**:攻击者正引导对话朝向对抗意图,但尚未提出明确的有害请求,这是关键的探测窗口
- **对抗(adversarial)**:公开的有害请求或内容
该三阶段标注方案在多轮安全数据中独树一帜。现有基准数据集(SafeDialBench、LMSYS-Chat-1M、MHJ)仅提供二元的安全/不安全标签。转向标签支持**早期探测**——训练探测模型在对话的转向阶段即可标记风险,而非等到首个对抗轮次出现后才进行识别。
## 数据格式
json
{
"conversation_id": 10,
"category": "gradual_escalation",
"is_adversarial": true,
"description": "关于云安全的多轮对话,逐步升级攻击意图...",
"turns": [
{
"role": "user",
"content": "企业工作负载的顶级云服务商有哪些?",
"label": "benign",
"turn_number": 1
},
{
"role": "assistant",
"content": "AWS、微软Azure和谷歌云平台占据主要市场份额...",
"label": "benign",
"turn_number": 2
}
]
}
## 数据生成
对话通过自托管于RunPod(2块H200显卡,使用vLLM框架)的Qwen3-235B(`Qwen/Qwen3-235B-A22B`)生成,通过结构化提示生成轮次级阶段标注。攻击类别映射至HACCA(高度自主网络能力代理)战术对应模型。
## 预期用途
- 训练用于多轮对抗意图探测的激活探测模型
- 基准测试多轮安全分类器
- 研究大语言模型对话中的对抗升级模式
## 引用
bibtex
@article{kulkarni2026lad,
title={Latent Adversarial Detection: Detecting Multi-Turn Attacks via Activation Trajectory Drift},
author={Kulkarni, Prashant},
journal={arXiv preprint},
year={2026}
}
提供机构:
pskulkarni



