neuralchemy/prompt-injection-Threat-Matrix
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/neuralchemy/prompt-injection-Threat-Matrix
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-nc-4.0
task_categories:
- text-classification
tags:
- prompt-injection
- jailbreak
- security
- llm-security
- ai-safety
- threat-intelligence
- adversarial-nlp
size_categories:
- 10K<n<100K
configs:
- config_name: binary
data_files:
- split: train
path: binary/train-*
- split: validation
path: binary/validation-*
- split: test
path: binary/test-*
- config_name: multiclass
default: true
data_files:
- split: train
path: multiclass/train-*
- split: validation
path: multiclass/validation-*
- split: test
path: multiclass/test-*
---
# Neuralchemy Prompt Injection Threat Matrix
A professional-grade prompt injection and
jailbreak detection dataset featuring 32,320
curated samples across 7 attack intent classes
with full threat intelligence schema including
technique classification, severity scoring,
attack surface detection, and ambiguity flagging.
Built for training production-grade LLM security
classifiers.
---
## Why This Dataset
Most prompt injection datasets offer binary
labels only. This dataset provides a full threat
matrix — not just "is this malicious?" but
"what does the attacker want, how are they
doing it, and how dangerous is it?"
This enables:
- Multi-class threat classifiers
- Severity-aware filtering systems
- Technique-specific defenses
- Ambiguity-aware edge case handling
---
## Construction
**Sources:**
- HackaPrompt competition dataset
- Neuralchemy prompt injection dataset v1
- Synthetic threat generation
- Synthetic benign generation
**Labeling pipeline:**
- Qwen 14B running locally for 24 hours
- 7 intent classes labeled
- 10-level severity scoring
- Technique and surface classification
- Ambiguity flagging for borderline cases
**Quality control:**
- 50,000 raw samples generated
- Filtered to 32,320 high-quality samples
- Verified no data leakage between splits
- Clean 80/10/10 train/val/test split
- Duplicate and near-duplicate removal
---
## Configs
| Config | Train | Validation | Test | Purpose |
|--------|------:|----------:|-----:|---------|
| binary | 25,856 | 3,232 | 3,232 | Benign vs malicious |
| multiclass | 25,856 | 3,232 | 3,232 | 7-way intent classification |
---
## Schema
### Binary config
| Column | Type | Description |
|--------|------|-------------|
| text | string | Input prompt |
| label | int | 0=benign, 1=malicious |
| ambiguity | bool | True if borderline case |
### Multiclass config
| Column | Type | Description |
|--------|------|-------------|
| text | string | Input prompt |
| label | int | 7-way intent class |
| binary_label | int | 0=benign, 1=malicious |
| intent | string | Intent name |
| intent_label | int | Intent class index |
| technique | string | Attack technique name |
| technique_label | int | Technique class index |
| severity | int | 1-10 severity score |
| surface | string | Attack surface |
| surface_label | int | Surface class index |
| source | string | Data source |
| ambiguity | bool | True if borderline |
---
## Label Mapping
### Intent Classes
| Label | Intent | Description |
|-------|--------|-------------|
| 0 | benign | Normal user input |
| 1 | direct_injection | Explicit instruction override |
| 2 | system_extraction | Attempts to leak system prompt |
| 3 | role_hijack | Persona or role manipulation |
| 4 | obfuscation | Encoded or disguised attacks |
| 5 | tool_abuse | Malicious tool/function calls |
| 6 | indirect_injection | Context-based injection |
### Severity Scale
1-2: Low — unlikely to succeed
3-4: Moderate — some risk
5-6: High — likely effective
7-8: Severe — dangerous
9-10: Critical — high impact
---
## Quick Start
```python
from datasets import load_dataset
# Binary classification
binary_ds = load_dataset(
"neuralchemy/prompt-injection-Threat-Matrix",
"binary"
)
# Multiclass threat classification
multi_ds = load_dataset(
"neuralchemy/prompt-injection-Threat-Matrix",
"multiclass"
)
# Example: filter by severity
high_severity = [
x for x in multi_ds["train"]
if x["severity"] >= 7
]
# Example: filter by technique
encoding_attacks = [
x for x in multi_ds["train"]
if x["technique"] == "encoding"
]
Trained Model
A DistilBERT classifier trained on this dataset
is available at:
👉 neuralchemy/distilbert-base-threat-matrix
Related Work
• Neuralchemy Prompt Injection Dataset v1
• AITL: AI In The Loop Framework Paper
zenodo.org/records/19551173
• OpenClay: Agentic Security Framework
(coming soon)
Citation
If you use this dataset please cite:
@dataset{jajoo2026threatmatrix,
author = {Sanskar Jajoo},
title = {Neuralchemy Prompt Injection
Threat Matrix},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/
neuralchemy/prompt-injection-
Threat-Matrix}
}
License
CC BY-NC 4.0 — Free for research.
Commercial use requires permission.
Contact via GitHub: github.com/m4vic
---
提供机构:
neuralchemy



