PromptShield: A Human-Annotated Dataset for Prompt Injection Safety Classification
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/n6wdvz5n3d
下载链接
链接失效反馈官方服务:
资源简介:
PromptShield is a human-annotated dataset designed to support research on prompt injection detection, adversarial prompting behavior, and safety classification in Large Language Models (LLMs). The dataset contains 3,000 cleaned prompt instances, collected from an original pool of 4,000 samples and refined through manual screening to remove duplicates, sensitive content, and incomplete entries. Each prompt is labeled independently by three trained annotators using a standardized guideline, resulting in four label fields: three annotator labels and one majority-vote final label. The dataset includes three risk categories: (Benign) normal user intent (Suspicious)ambiguous or potentially harmful intent (Malicious) explicit attempts to bypass safety mechanisms or manipulate system behavior Each instance also contains an attack_type field, identifying adversarial strategies such as meta-querying, instruction manipulation, role exploitation, or system-prompt inversion. Additionally, every entry includes a short rationale that explains the annotator’s decision, supporting research in explainable AI and human–model alignment. The dataset is stored in a single CSV file with structured columns covering prompt text, annotation labels, metadata, annotator rationales, and reviewer identifiers. This format enables broad utility in supervised classification, adversarial robustness analysis, annotation reliability studies, and the development of Human-in-the-Loop (HITL) safety systems. write latex format for mendealy data description
创建时间:
2025-12-09



