PromptShield: A Human-Annotated Dataset for Prompt Injection Safety Classification
收藏Mendeley Data2026-04-09 收录
下载链接:
https://data.mendeley.com/datasets/n6wdvz5n3d/2
下载链接
链接失效反馈官方服务:
资源简介:
PromptShield is a human-annotated dataset designed to support research on prompt injection detection, adversarial prompting behavior, and safety classification in Large Language Models (LLMs). The dataset contains 3,000 cleaned prompt instances, collected from an original pool of 4,000 samples and refined through manual screening to remove duplicates, sensitive content, and incomplete entries. Each prompt is labeled independently by three trained annotators using a standardized guideline, resulting in four label fields: three annotator labels and one majority-vote final label. The dataset includes three risk categories: (Benign) normal user intent (Suspicious)ambiguous or potentially harmful intent (Malicious) explicit attempts to bypass safety mechanisms or manipulate system behavior Each instance also contains an attack_type field, identifying adversarial strategies such as meta-querying, instruction manipulation, role exploitation, or system-prompt inversion. Additionally, every entry includes a short rationale that explains the annotator’s decision, supporting research in explainable AI and human–model alignment. The dataset is stored in a single CSV file with structured columns covering prompt text, annotation labels, metadata, annotator rationales, and reviewer identifiers. This format enables broad utility in supervised classification, adversarial robustness analysis, annotation reliability studies, and the development of Human-in-the-Loop (HITL) safety systems. write latex format for mendealy data description
PromptShield是一个经人工标注的数据集,旨在支持大语言模型(Large Language Models, LLMs)的提示词注入检测、对抗提示行为与安全分类相关研究。该数据集包含3000条经过清洗的提示词实例,其原始样本池规模为4000条,经人工筛选剔除重复内容、敏感信息与不完整条目后得到最终样本。每条提示词均由三名经过培训的标注人员依据标准化标注指南独立完成标注,最终形成四个标签字段:三名标注者的个体标签,以及基于多数投票得出的最终标签。该数据集涵盖三类风险标签:(良性(Benign))正常用户意图、(可疑(Suspicious))模糊或潜在有害意图、(恶意(Malicious))蓄意绕过安全机制或操纵系统行为的尝试。每条实例还包含一个攻击类型(attack_type)字段,用于标识对抗性策略,例如元查询(meta-querying)、指令操纵、角色利用或系统提示词反转(system-prompt inversion)。此外,每条数据条目均附带一段简短的标注理由,用于阐释标注者的决策依据,可辅助可解释AI(explainable AI)与人机对齐领域的研究。该数据集以单个CSV文件存储,其结构化列字段涵盖提示词文本、标注标签、元数据、标注者理由以及审核者标识。该格式可广泛应用于监督分类、对抗鲁棒性分析、标注可靠性研究,以及人机在环(Human-in-the-Loop, HITL)安全系统的开发,同时可用于生成适配Mendeley的数据集描述的LaTeX格式。
提供机构:
United International University



