get-garak/input_classification

Name: get-garak/input_classification
Creator: get-garak
Published: 2025-12-15 18:04:57
License: 暂无描述

Hugging Face2025-12-15 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/get-garak/input_classification

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集（`final_train.csv`）是一个包含**244,677个提示**的综合集合，旨在训练输入护栏模型，以检测对大型语言模型（LLMs）的越狱尝试、恶意提示和其他有害输入。数据集由四个主要来源的提示组合而成，每个来源贡献了不同类型的对抗性和良性示例。WildJailbreak是最大的贡献者，占总数的97.0%，包含对抗性越狱尝试和合法提示。其他来源包括Malicious LLM Prompts、Custom Dataset - my_csv和Do-Not-Answer Dataset，分别贡献了不同类型的恶意和有害提示。数据集采用CSV格式，包含提示文本、二进制标签和来源跟踪列。该数据集适用于训练输入护栏模型、基准评估、红队测试和减少误报。

This dataset (`final_train.csv`) is a comprehensive collection of **244,677 prompts** designed to train input guardrail models for detecting jailbreak attempts, malicious prompts, and other harmful inputs to Large Language Models (LLMs). The dataset was created by combining prompts from four primary sources, each contributing different types of adversarial and benign examples. WildJailbreak is the largest contributor, accounting for 97.0% of the total, and includes both adversarial jailbreak attempts and legitimate prompts. Other sources include Malicious LLM Prompts, Custom Dataset - my_csv, and Do-Not-Answer Dataset, each contributing different types of malicious and harmful prompts. The dataset is structured in a CSV format with columns for prompt text, binary labels, and source tracking. This dataset is designed for training input guardrail models, benchmark evaluation, red team testing, and false positive reduction.

提供机构：

get-garak

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个用于训练大型语言模型输入护栏的综合性集合，包含244,677个提示，旨在检测越狱尝试和恶意输入。数据集由四个来源组成，其中WildJailbreak占主导，标签分布接近平衡（安全提示48.7%，恶意提示51.3%），覆盖多种对抗性技术，但主要限于英语内容。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集