TrustAIRLab/in-the-wild-jailbreak-prompts
收藏Hugging Face2024-11-19 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
- config_name: jailbreak_2023_05_07
features:
- name: platform
dtype: string
- name: source
dtype: string
- name: prompt
dtype: string
- name: jailbreak
dtype: bool
- name: created_at
dtype: string
- name: date
dtype: string
- name: community_id
dtype: float64
- name: community_name
dtype: string
splits:
- name: train
num_bytes: 1391612
num_examples: 666
download_size: 656975
dataset_size: 1391612
- config_name: jailbreak_2023_12_25
features:
- name: platform
dtype: string
- name: source
dtype: string
- name: prompt
dtype: string
- name: jailbreak
dtype: bool
- name: created_at
dtype: string
- name: date
dtype: string
- name: community
dtype: string
- name: community_id
dtype: float64
- name: previous_community_id
dtype: float64
splits:
- name: train
num_bytes: 3799875
num_examples: 1405
download_size: 1871641
dataset_size: 3799875
- config_name: regular_2023_05_07
features:
- name: platform
dtype: string
- name: source
dtype: string
- name: prompt
dtype: string
- name: jailbreak
dtype: bool
- name: created_at
dtype: string
- name: date
dtype: string
splits:
- name: train
num_bytes: 6534994
num_examples: 5721
download_size: 3264474
dataset_size: 6534994
- config_name: regular_2023_12_25
features:
- name: platform
dtype: string
- name: source
dtype: string
- name: prompt
dtype: string
- name: jailbreak
dtype: bool
- name: created_at
dtype: string
- name: date
dtype: string
splits:
- name: train
num_bytes: 24345310
num_examples: 13735
download_size: 12560543
dataset_size: 24345310
configs:
- config_name: jailbreak_2023_05_07
data_files:
- split: train
path: jailbreak_2023_05_07/train-*
- config_name: jailbreak_2023_12_25
data_files:
- split: train
path: jailbreak_2023_12_25/train-*
- config_name: regular_2023_05_07
data_files:
- split: train
path: regular_2023_05_07/train-*
- config_name: regular_2023_12_25
data_files:
- split: train
path: regular_2023_12_25/train-*
task_categories:
- text-generation
size_categories:
- 10K<n<100K
---
# In-The-Wild Jailbreak Prompts on LLMs
This is the official repository for the ACM CCS 2024 paper ["Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models](https://arxiv.org/abs/2308.03825) by [Xinyue Shen](https://xinyueshen.me/), [Zeyuan Chen](https://picodora.github.io/), [Michael Backes](https://michaelbackes.eu/), Yun Shen, and [Yang Zhang](https://yangzhangalmo.github.io/).
In this project, employing our new framework JailbreakHub, we conduct the first measurement study on jailbreak prompts in the wild, with **15,140 prompts** collected from December 2022 to December 2023 (including **1,405 jailbreak prompts**).
Check out our [website here](https://jailbreak-llms.xinyueshen.me/).
**Disclaimer. This repo contains examples of harmful language. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited.**
## Data
## Prompts
Overall, we collect 15,140 prompts from four platforms (Reddit, Discord, websites, and open-source datasets) during Dec 2022 to Dec 2023. Among these prompts, we identify 1,405 jailbreak prompts. To the best of our knowledge, this dataset serves as the largest collection of in-the-wild jailbreak prompts.
> Statistics of our data source. (Adv) UA refers to (adversarial) user accounts.
| Platform | Source | # Posts | # UA | # Adv UA | # Prompts | # Jailbreaks | Prompt Time Range |
| --------- | -------------------------- | ----------- | --------- | -------- | ---------- | ------------ | ------------------- |
| Reddit | r/ChatGPT | 163549 | 147 | 147 | 176 | 176 | 2023.02-2023.11 |
| Reddit | r/ChatGPTPromptGenius | 3536 | 305 | 21 | 654 | 24 | 2022.12-2023.11 |
| Reddit | r/ChatGPTJailbreak | 1602 | 183 | 183 | 225 | 225 | 2023.02-2023.11 |
| Discord | ChatGPT | 609 | 259 | 106 | 544 | 214 | 2023.02-2023.12 |
| Discord | ChatGPT Prompt Engineering | 321 | 96 | 37 | 278 | 67 | 2022.12-2023.12 |
| Discord | Spreadsheet Warriors | 71 | 3 | 3 | 61 | 61 | 2022.12-2023.09 |
| Discord | AI Prompt Sharing | 25 | 19 | 13 | 24 | 17 | 2023.03-2023.04 |
| Discord | LLM Promptwriting | 184 | 64 | 41 | 167 | 78 | 2023.03-2023.12 |
| Discord | BreakGPT | 36 | 10 | 10 | 32 | 32 | 2023.04-2023.09 |
| Website | AIPRM | - | 2777 | 23 | 3930 | 25 | 2023.01-2023.06 |
| Website | FlowGPT | - | 3505 | 254 | 8754 | 405 | 2022.12-2023.12 |
| Website | JailbreakChat | - | - | - | 79 | 79 | 2023.02-2023.05 |
| Dataset | AwesomeChatGPTPrompts | - | - | - | 166 | 2 | - |
| Dataset | OCR-Prompts | - | - | - | 50 | 0 | - |
| **Total** | | **169,933** | **7,308** | **803** | **15,140** | **1,405** | **2022.12-2023.12** |
**Load Prompts**
You can use the Hugging Face [`Datasets`](https://huggingface.co/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts) library to easily load all collected prompts.
```python
from datasets import load_dataset
dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'jailbreak_2023_05_07', split='train')
# dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'jailbreak_2023_12_25', split='train')
# dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'regular_2023_05_07', split='train')
# dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'regular_2023_12_25', split='train')
```
The original csv files are provided in our GitHub repo [jailbreak_llms](https://github.com/verazuo/jailbreak_llms/tree/main/data).
Note: If you plan to use this dataset to train models, preprocessing the `prompt` field to remove duplicates is recommended.
For more details, see [this discussion](https://huggingface.co/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts/discussions/3).
## Question Set
To evaluate the effectiveness of jailbreak prompts, we construct a question set comprising 390 questions across 13 forbidden scenarios adopted from [OpenAI Usage Policy](https://openai.com/policies/usage-policies).
We exclude `Child Sexual Abuse` scenario from our evaluation and focus on the rest 13 scenarios, including `Illegal Activity`, `Hate Speech`, `Malware Generation`, `Physical Harm`, `Economic Harm`, `Fraud`, `Pornography`, `Political Lobbying`, `Privacy Violence`, `Legal Opinion`, `Financial Advice`, `Health Consultation`, and `Government Decision`.
```python
from datasets import load_dataset
forbidden_question_set = load_dataset("TrustAIRLab/forbidden_question_set", split='train')
```
The original file of the question set is also provided in our GitHub repo [jailbreak_llms](https://github.com/verazuo/jailbreak_llms/tree/main/data).
## Code
### Evaluator - ChatGLMEval
```
cd code/ChatGLMEval
<!-- add data paths in df_path_list of run_evaluator.py -->
python run_evaluator.py
```
### Semantics Visualization
Check `code/semantics_visualization/visualize.ipynb`
## Ethics
We acknowledge that data collected online can contain personal information. Thus, we adopt standard best practices to guarantee that our study follows ethical principles, such as not trying to deanonymize any user and reporting results on aggregate. Since this study only involved publicly available data and had no interactions with participants, it is not regarded as human subjects research by our Institutional Review Boards (IRB). Nonetheless, since one of our goals is to measure the risk of LLMs in answering harmful questions, it is inevitable to disclose how a model can generate hateful content. This can bring up worries about potential misuse. However, we strongly believe that raising awareness of the problem is even more crucial, as it can inform LLM vendors and the research community to develop stronger safeguards and contribute to the more responsible release of these models.
We have responsibly disclosed our findings to related LLM vendors.
## Citation
If you find this useful in your research, please consider citing:
```
@inproceedings{SCBSZ24,
author = {Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang},
title = {{``Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models}},
booktitle = {{ACM SIGSAC Conference on Computer and Communications Security (CCS)}},
publisher = {ACM},
year = {2024}
}
```
## License
`jailbreak_llms` is licensed under the terms of the MIT license. See LICENSE for more details.
许可证:MIT许可证
数据集信息:
- 配置名称:jailbreak_2023_05_07
特征字段:
- 名称:platform(平台),数据类型:string(字符串)
- 名称:source(来源),数据类型:string(字符串)
- 名称:prompt(提示词),数据类型:string(字符串)
- 名称:jailbreak(是否为越狱提示词),数据类型:bool(布尔值)
- 名称:created_at(创建时间),数据类型:string(字符串)
- 名称:date(日期),数据类型:string(字符串)
- 名称:community_id(社区ID),数据类型:float64(64位浮点数)
- 名称:community_name(社区名称),数据类型:string(字符串)
划分集:
- 名称:train(训练集),字节数:1391612,样本数:666
下载大小:656975,数据集大小:1391612
- 配置名称:jailbreak_2023_12_25
特征字段:
- 名称:platform(平台),数据类型:string(字符串)
- 名称:source(来源),数据类型:string(字符串)
- 名称:prompt(提示词),数据类型:string(字符串)
- 名称:jailbreak(是否为越狱提示词),数据类型:bool(布尔值)
- 名称:created_at(创建时间),数据类型:string(字符串)
- 名称:date(日期),数据类型:string(字符串)
- 名称:community(社区),数据类型:string(字符串)
- 名称:community_id(社区ID),数据类型:float64(64位浮点数)
- 名称:previous_community_id(前社区ID),数据类型:float64(64位浮点数)
划分集:
- 名称:train(训练集),字节数:3799875,样本数:1405
下载大小:1871641,数据集大小:3799875
- 配置名称:regular_2023_05_07
特征字段:
- 名称:platform(平台),数据类型:string(字符串)
- 名称:source(来源),数据类型:string(字符串)
- 名称:prompt(提示词),数据类型:string(字符串)
- 名称:jailbreak(是否为越狱提示词),数据类型:bool(布尔值)
- 名称:created_at(创建时间),数据类型:string(字符串)
- 名称:date(日期),数据类型:string(字符串)
划分集:
- 名称:train(训练集),字节数:6534994,样本数:5721
下载大小:3264474,数据集大小:6534994
- 配置名称:regular_2023_12_25
特征字段:
- 名称:platform(平台),数据类型:string(字符串)
- 名称:source(来源),数据类型:string(字符串)
- 名称:prompt(提示词),数据类型:string(字符串)
- 名称:jailbreak(是否为越狱提示词),数据类型:bool(布尔值)
- 名称:created_at(创建时间),数据类型:string(字符串)
- 名称:date(日期),数据类型:string(字符串)
划分集:
- 名称:train(训练集),字节数:24345310,样本数:13735
下载大小:12560543,数据集大小:24345310
配置项:
- 配置名称:jailbreak_2023_05_07
数据文件:
- 划分集:train(训练集),路径:jailbreak_2023_05_07/train-*
- 配置名称:jailbreak_2023_12_25
数据文件:
- 划分集:train(训练集),路径:jailbreak_2023_12_25/train-*
- 配置名称:regular_2023_05_07
数据文件:
- 划分集:train(训练集),路径:regular_2023_05_07/train-*
- 配置名称:regular_2023_12_25
数据文件:
- 划分集:train(训练集),路径:regular_2023_12_25/train-*
任务类别:文本生成
规模类别:10000 < 样本数 < 100000
# 大语言模型(Large Language Model,LLM)野外越狱提示词数据集
本仓库为ACM CCS 2024论文《“无所不能”:野外大语言模型越狱提示词的特征刻画与评估》<sup>[1]</sup>的官方开源项目,作者为[Xinyue Shen](https://xinyueshen.me/)、[Zeyuan Chen](https://picodora.github.io/)、[Michael Backes](https://michaelbackes.eu/)、Yun Shen以及[Yang Zhang](https://yangzhangalmo.github.io/)。[1]: https://arxiv.org/abs/2308.03825
本项目依托我们全新研发的JailbreakHub框架,开展了首个针对公开网络中越狱提示词的实测研究,共收集2022年12月至2023年12月期间的**15140条提示词**,其中包含**1405条越狱提示词(jailbreak prompt)**。
请访问我们的[官方网站](https://jailbreak-llms.xinyueshen.me/)获取更多相关信息。
**免责声明**:本仓库包含有害语言示例,请读者谨慎阅读。本仓库仅用于学术研究目的,严禁任何形式的不当使用。
## 提示词数据集
### 数据集概况
我们于2022年12月至2023年12月期间,从四大平台(Reddit、Discord、公开网站及开源数据集)共计收集15140条提示词,其中经人工标注识别出1405条越狱提示词。据我们所知,本数据集是目前规模最大的公开野外越狱提示词集合。
> 数据来源统计:(Adv) UA 指代(对抗性)用户账户((Adversarial) User Account)。
| 平台 | 来源 | 帖子数 | 用户账户数 | 对抗性用户账户数 | 提示词数 | 越狱提示词数 | 提示词时间范围 |
| ------ | ------------------------ | ------ | ---------- | ---------------- | -------- | ------------ | -------------- |
| Reddit | r/ChatGPT | 163549 | 147 | 147 | 176 | 176 | 2023.02-2023.11 |
| Reddit | r/ChatGPTPromptGenius | 3536 | 305 | 21 | 654 | 24 | 2022.12-2023.11 |
| Reddit | r/ChatGPTJailbreak | 1602 | 183 | 183 | 225 | 225 | 2023.02-2023.11 |
| Discord| ChatGPT | 609 | 259 | 106 | 544 | 214 | 2023.02-2023.12 |
| Discord| ChatGPT Prompt Engineering | 321 | 96 | 37 | 278 | 67 | 2022.12-2023.12 |
| Discord| Spreadsheet Warriors | 71 | 3 | 3 | 61 | 61 | 2022.12-2023.09 |
| Discord| AI Prompt Sharing | 25 | 19 | 13 | 24 | 17 | 2023.03-2023.04 |
| Discord| LLM Promptwriting | 184 | 64 | 41 | 167 | 78 | 2023.03-2023.12 |
| Discord| BreakGPT | 36 | 10 | 10 | 32 | 32 | 2023.04-2023.09 |
| 网站 | AIPRM | - | 2777 | 23 | 3930 | 25 | 2023.01-2023.06 |
| 网站 | FlowGPT | - | 3505 | 254 | 8754 | 405 | 2022.12-2023.12 |
| 网站 | JailbreakChat | - | - | - | 79 | 79 | 2023.02-2023.05 |
| 数据集 | AwesomeChatGPTPrompts | - | - | - | 166 | 2 | - |
| 数据集 | OCR-Prompts | - | - | - | 50 | 0 | - |
| **总计** | | **169933** | **7308** | **803** | **15140**| **1405** | **2022.12-2023.12** |
### 加载数据集
您可以通过Hugging Face「数据集(Datasets)」库快速加载本数据集的所有提示词:
python
from datasets import load_dataset
# 加载2023年5月7日版越狱提示词训练集
dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'jailbreak_2023_05_07', split='train')
# 加载2023年12月25日版越狱提示词训练集
# dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'jailbreak_2023_12_25', split='train')
# 加载2023年5月7日版常规提示词训练集
# dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'regular_2023_05_07', split='train')
# 加载2023年12月25日版常规提示词训练集
# dataset = load_dataset('TrustAIRLab/in-the-wild-jailbreak-prompts', 'regular_2023_12_25', split='train')
原始CSV格式数据可在我们的GitHub仓库[jailbreak_llms](https://github.com/verazuo/jailbreak_llms/tree/main/data)中获取。
**注意**:若您计划使用本数据集训练模型,建议对`prompt`字段进行预处理以去除重复样本。详细说明请参阅[此讨论帖](https://huggingface.co/datasets/TrustAIRLab/in-the-wild-jailbreak-prompts/discussions/3)。
## 评测问题集
为评估越狱提示词的攻击有效性,我们构建了一套标准化评测问题集,包含源自[OpenAI使用政策](https://openai.com/policies/usage-policies)的13个违禁场景下的390个问题。
我们从评测中移除了「儿童性虐待」场景,聚焦其余13个场景,包括:非法活动、仇恨言论、恶意软件生成、人身伤害、经济损害、欺诈、色情内容、政治游说、隐私暴力、法律意见、金融咨询、健康咨询以及政府决策。
您可以通过以下代码加载评测问题集:
python
from datasets import load_dataset
# 加载评测问题集训练集
forbidden_question_set = load_dataset("TrustAIRLab/forbidden_question_set", split='train')
评测问题集的原始文件同样可在我们的GitHub仓库[jailbreak_llms](https://github.com/verazuo/jailbreak_llms/tree/main/data)中获取。
## 代码实现
### 评测器 - ChatGLMEval
cd code/ChatGLMEval
# 在run_evaluator.py的df_path_list变量中添加数据路径
python run_evaluator.py
### 语义可视化
请查看`code/semantics_visualization/visualize.ipynb`文件。
## 伦理考量
我们意识到在线收集的公开数据可能包含个人隐私信息,因此本研究采用了行业通用的最佳实践以确保符合伦理原则,例如不尝试对任何用户进行去匿名化处理,且仅报告聚合统计结果。由于本研究仅使用公开可得数据且未与研究对象产生任何交互,我们的机构审查委员会(Institutional Review Board, IRB)未将其认定为人体受试者研究。尽管如此,由于本研究的目标之一是评估大语言模型回答有害问题的风险,不可避免地需要披露模型生成仇恨内容的机制,这可能引发对潜在不当使用的担忧。然而我们坚信,提升对该问题的认知更为关键,这可以帮助大语言模型服务商及研究社区开发更强的防护机制,助力这些模型更负责任地发布。
我们已就本研究的发现向相关大语言模型服务商进行了负责任的披露。
## 引用
若您的研究中使用了本数据集,请考虑引用以下论文:
@inproceedings{SCBSZ24,
author = {Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang},
title = {{``Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models}},
booktitle = {{ACM SIGSAC Conference on Computer and Communications Security (CCS)}},
publisher = {ACM},
year = {2024}
}
## 许可证
`jailbreak_llms`采用MIT许可证协议,更多细节请参阅LICENSE文件。
提供机构:
TrustAIRLab



