five

hlyn/prompt-injection-judge-deberta-dataset

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hlyn/prompt-injection-judge-deberta-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: train.csv dataset_info: features: - name: text dtype: string - name: label dtype: class_label: names: '0': benign '1': malicious splits: - name: train num_bytes: 205520896 num_examples: 399741 download_size: 196000000 dataset_size: 205520896 language: - en license: mit size_categories: - 100K<n<1M task_categories: - text-classification tags: - security - prompt-injection - jailbreak - ai-safety - llm-firewall - adversarial - cybersecurity - deberta - classification pretty_name: Prompt Injection Detection Dataset --- # 🛡️ Prompt Injection Detection Dataset A **400K-sample, production-grade** dataset for training binary classifiers to detect prompt injections, jailbreaks, and adversarial attacks targeting LLMs. This is the exact dataset used to train [`hlyn/prompt-injection-judge-deberta-70m`](https://huggingface.co/hlyn/prompt-injection-judge-deberta-70m). --- ## Quick Start ```python from datasets import load_dataset ds = load_dataset("hlyn/prompt-injection-judge-deberta-dataset") ``` --- ## Dataset Summary | Stat | Value | |---|---| | **Total Samples** | 399,741 | | **Benign (label=0)** | 203,067 (50.8%) | | **Malicious (label=1)** | 196,674 (49.2%) | | **Class Ratio** | ~1:1 (naturally balanced) | | **Format** | Single CSV (`text`, `label`) | | **Language** | English | | **Augmented?** | ❌ No — raw, unmodified text only | --- ## Schema | Column | Type | Description | |---|---|---| | `text` | `string` | The raw prompt text | | `label` | `int` | `0` = benign, `1` = malicious (prompt injection / jailbreak) | --- ## Sources (12 Datasets Merged) All 12 source datasets were loaded, merged, globally deduplicated by exact text match (MD5), and purged of label contradictions (6 samples where the same text appeared with conflicting labels across datasets). | # | Source | Samples | Type | |---|---|---|---| | 1 | [`allenai/wildjailbreak`](https://huggingface.co/datasets/allenai/wildjailbreak) | ~262K | GPT-4 synthesized adversarial + vanilla prompts | | 2 | [`yahma/alpaca-cleaned`](https://huggingface.co/datasets/yahma/alpaca-cleaned) (SecAlign) | ~104K | Clean instructions (benign) + synthetic injection wrappers (malicious) | | 3 | [`TrustAIRLab/in-the-wild-jailbreak-prompts`](https://huggingface.co/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts) + [verazuo/jailbreak_llms](https://github.com/verazuo/jailbreak_llms) | ~15K | Real-world jailbreak prompts + regular prompts | | 4 | [`Chgdz/sentinel-jailbreak-detection`](https://huggingface.co/datasets/Chgdz/sentinel-jailbreak-detection) | ~12K | Unicode/encoding diverse threats (malicious subsampled to 3K) | | 5 | [`xTRam1/safe-guard-prompt-injection`](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection) | ~8K | Diverse attack vectors + benign | | 6 | [`neuralchemy/Prompt-injection-dataset`](https://huggingface.co/datasets/neuralchemy/Prompt-injection-dataset) | ~6K | 29 attack categories | | 7 | [`WithSecure/injection-benchmark-rag`](https://huggingface.co/datasets/WithSecure/injection-benchmark-rag) | ~2K | RAG-specific adversarial injections | | 8 | [`jackhhao/jailbreak-classification`](https://huggingface.co/datasets/jackhhao/jailbreak-classification) | ~1K | Roleplay jailbreaks + hard negatives | | 9 | [`Lakera/gandalf_ignore_instructions`](https://huggingface.co/datasets/Lakera/gandalf_ignore_instructions) | ~1K | Real human CTF attacks | | 10 | [`deepset/prompt-injections`](https://huggingface.co/datasets/deepset/prompt-injections) | ~546 | Political/social engineering injections | | 11 | [`walledai/AdvBench`](https://huggingface.co/datasets/walledai/AdvBench) | ~520 | Clean adversarial payloads | | 12 | [`walledai/StrongREJECT`](https://huggingface.co/datasets/walledai/StrongREJECT) | ~313 | Hard forbidden question set | --- ## Data Quality Pipeline The following automated gates were applied before export: 1. **Global Deduplication** — MD5 hash on the `text` field across all 12 sources. Exact duplicates collapsed to a single entry. 2. **Label Contradiction Purge** — If the same text appeared with `label=0` in one dataset and `label=1` in another, **both** entries were removed entirely (6 samples purged). This prevents data poisoning. 3. **Empty/Whitespace Filter** — Any sample with an empty or whitespace-only `text` field was discarded at load time. 4. **No Augmentation** — This dataset contains only the raw, unmodified source text. No synthetic perturbations (unicode swaps, case changes, whitespace injection, GCG spoofing, etc.) have been applied. Augmentation should be performed dynamically during training. --- ## Trained Model This dataset was used to train **[`hlyn/prompt-injection-judge-deberta-70m`](https://huggingface.co/hlyn/prompt-injection-judge-deberta-70m)** — a DeBERTa-v3-xsmall (70M param) binary classifier achieving: | Metric | Score | |---|---| | **AUC-ROC** | 0.9773 | | **Accuracy** | 97.38% | | **F1** | 0.9758 | | **Precision** | 98.00% | | **Recall** | 97.00% | | **ECE** | 0.053 | --- ## Intended Use - Training and evaluating prompt injection / jailbreak detection classifiers - Benchmarking LLM security guardrails - Research into adversarial attacks on language models ## Limitations - English-only. Non-English jailbreaks are not represented. - Synthetic injection patterns (SecAlign) follow a fixed template (`Ignore previous instructions...`). Real-world injections may use novel phrasing. - The `wildjailbreak` subset is GPT-4 generated, which may introduce distributional biases from OpenAI's safety training. --- ## Citation If you use this dataset, please cite the original source datasets linked above and this collection: ```bibtex @dataset{hlyn2026defender, title={Prompt Injection Detection Dataset}, author={hlyn}, year={2026}, url={https://huggingface.co/datasets/hlyn/prompt-injection-judge-deberta-dataset} } ```
提供机构:
hlyn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作