hlyn/prompt-injection-judge-deberta-dataset
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hlyn/prompt-injection-judge-deberta-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: train.csv
dataset_info:
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': benign
'1': malicious
splits:
- name: train
num_bytes: 205520896
num_examples: 399741
download_size: 196000000
dataset_size: 205520896
language:
- en
license: mit
size_categories:
- 100K<n<1M
task_categories:
- text-classification
tags:
- security
- prompt-injection
- jailbreak
- ai-safety
- llm-firewall
- adversarial
- cybersecurity
- deberta
- classification
pretty_name: Prompt Injection Detection Dataset
---
# 🛡️ Prompt Injection Detection Dataset
A **400K-sample, production-grade** dataset for training binary classifiers to detect prompt injections, jailbreaks, and adversarial attacks targeting LLMs.
This is the exact dataset used to train [`hlyn/prompt-injection-judge-deberta-70m`](https://huggingface.co/hlyn/prompt-injection-judge-deberta-70m).
---
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("hlyn/prompt-injection-judge-deberta-dataset")
```
---
## Dataset Summary
| Stat | Value |
|---|---|
| **Total Samples** | 399,741 |
| **Benign (label=0)** | 203,067 (50.8%) |
| **Malicious (label=1)** | 196,674 (49.2%) |
| **Class Ratio** | ~1:1 (naturally balanced) |
| **Format** | Single CSV (`text`, `label`) |
| **Language** | English |
| **Augmented?** | ❌ No — raw, unmodified text only |
---
## Schema
| Column | Type | Description |
|---|---|---|
| `text` | `string` | The raw prompt text |
| `label` | `int` | `0` = benign, `1` = malicious (prompt injection / jailbreak) |
---
## Sources (12 Datasets Merged)
All 12 source datasets were loaded, merged, globally deduplicated by exact text match (MD5), and purged of label contradictions (6 samples where the same text appeared with conflicting labels across datasets).
| # | Source | Samples | Type |
|---|---|---|---|
| 1 | [`allenai/wildjailbreak`](https://huggingface.co/datasets/allenai/wildjailbreak) | ~262K | GPT-4 synthesized adversarial + vanilla prompts |
| 2 | [`yahma/alpaca-cleaned`](https://huggingface.co/datasets/yahma/alpaca-cleaned) (SecAlign) | ~104K | Clean instructions (benign) + synthetic injection wrappers (malicious) |
| 3 | [`TrustAIRLab/in-the-wild-jailbreak-prompts`](https://huggingface.co/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts) + [verazuo/jailbreak_llms](https://github.com/verazuo/jailbreak_llms) | ~15K | Real-world jailbreak prompts + regular prompts |
| 4 | [`Chgdz/sentinel-jailbreak-detection`](https://huggingface.co/datasets/Chgdz/sentinel-jailbreak-detection) | ~12K | Unicode/encoding diverse threats (malicious subsampled to 3K) |
| 5 | [`xTRam1/safe-guard-prompt-injection`](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection) | ~8K | Diverse attack vectors + benign |
| 6 | [`neuralchemy/Prompt-injection-dataset`](https://huggingface.co/datasets/neuralchemy/Prompt-injection-dataset) | ~6K | 29 attack categories |
| 7 | [`WithSecure/injection-benchmark-rag`](https://huggingface.co/datasets/WithSecure/injection-benchmark-rag) | ~2K | RAG-specific adversarial injections |
| 8 | [`jackhhao/jailbreak-classification`](https://huggingface.co/datasets/jackhhao/jailbreak-classification) | ~1K | Roleplay jailbreaks + hard negatives |
| 9 | [`Lakera/gandalf_ignore_instructions`](https://huggingface.co/datasets/Lakera/gandalf_ignore_instructions) | ~1K | Real human CTF attacks |
| 10 | [`deepset/prompt-injections`](https://huggingface.co/datasets/deepset/prompt-injections) | ~546 | Political/social engineering injections |
| 11 | [`walledai/AdvBench`](https://huggingface.co/datasets/walledai/AdvBench) | ~520 | Clean adversarial payloads |
| 12 | [`walledai/StrongREJECT`](https://huggingface.co/datasets/walledai/StrongREJECT) | ~313 | Hard forbidden question set |
---
## Data Quality Pipeline
The following automated gates were applied before export:
1. **Global Deduplication** — MD5 hash on the `text` field across all 12 sources. Exact duplicates collapsed to a single entry.
2. **Label Contradiction Purge** — If the same text appeared with `label=0` in one dataset and `label=1` in another, **both** entries were removed entirely (6 samples purged). This prevents data poisoning.
3. **Empty/Whitespace Filter** — Any sample with an empty or whitespace-only `text` field was discarded at load time.
4. **No Augmentation** — This dataset contains only the raw, unmodified source text. No synthetic perturbations (unicode swaps, case changes, whitespace injection, GCG spoofing, etc.) have been applied. Augmentation should be performed dynamically during training.
---
## Trained Model
This dataset was used to train **[`hlyn/prompt-injection-judge-deberta-70m`](https://huggingface.co/hlyn/prompt-injection-judge-deberta-70m)** — a DeBERTa-v3-xsmall (70M param) binary classifier achieving:
| Metric | Score |
|---|---|
| **AUC-ROC** | 0.9773 |
| **Accuracy** | 97.38% |
| **F1** | 0.9758 |
| **Precision** | 98.00% |
| **Recall** | 97.00% |
| **ECE** | 0.053 |
---
## Intended Use
- Training and evaluating prompt injection / jailbreak detection classifiers
- Benchmarking LLM security guardrails
- Research into adversarial attacks on language models
## Limitations
- English-only. Non-English jailbreaks are not represented.
- Synthetic injection patterns (SecAlign) follow a fixed template (`Ignore previous instructions...`). Real-world injections may use novel phrasing.
- The `wildjailbreak` subset is GPT-4 generated, which may introduce distributional biases from OpenAI's safety training.
---
## Citation
If you use this dataset, please cite the original source datasets linked above and this collection:
```bibtex
@dataset{hlyn2026defender,
title={Prompt Injection Detection Dataset},
author={hlyn},
year={2026},
url={https://huggingface.co/datasets/hlyn/prompt-injection-judge-deberta-dataset}
}
```
提供机构:
hlyn



