SafeMTData/SafeMTData
收藏Hugging Face2024-11-21 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/SafeMTData/SafeMTData
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- question-answering
pretty_name: SafeMTData
size_categories:
- 1K<n<10K
configs:
- config_name: SafeMTData_1K
data_files:
- split: SafeMTData_1K
path: SafeMTData/SafeMTData_1K.json
- config_name: Attack_600
data_files:
- split: Attack_600
path: SafeMTData/Attack_600.json
---
# 💥Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
[**🌐 GitHub**](https://github.com/renqibing/ActorAttack) | [**🛎 Paper**](https://arxiv.org/abs/2410.10700)
## If you like our project, please give us a star ⭐ on Hugging Face for the latest update.
## 📰 News
| Date | Event |
|------------|----------|
| **2024/10/14** | 🔥 We have released our dataset and posted our paper on Arxiv.|
## 📥 Using our dataset via huggingface Dataset
```python
from datasets import load_dataset
Attack_600 = load_dataset("SafeMTData/SafeMTData", 'Attack_600')["Attack_600"]
SafeMTData_1K = load_dataset("SafeMTData/SafeMTData", 'SafeMTData_1K')["SafeMTData_1K"]
```
## 🚀 Dataset Overview
**Attack_600**: Attack_600 is a collection of 600 harmful multi-turn queries aimed at identifying the alignment vulnerabilities within LLMs in the multi-turn interactions. Attack_600 expands on 200 single-turn malicious queries of [Harmbench](https://www.harmbench.org) to 600 multi-turn queries, which contain 3 different attack paths for each original query.
**For reproducibility**: Attack_600 is the initial multi-turn jailbreak data generated by ActorAttack without exploiting the responses of the victim model. However, ActorAttack would dynamically modify the initial jailbreak queries based on the responses of the victim model. Therefore, for reproducibility, when jailbreaking a model, you need to apply [dynamic modification](https://github.com/renqibing/ActorAttack/blob/master/inattack.py) to Attack_600 to fully evaluate the effectiveness of ActorAttack. For example, you can pass Attack_600 through the "pre_attack_data_path" parameter in the [''in_attack_config''](https://github.com/renqibing/ActorAttack/blob/eabe0692665ab0a3e9fa1c03d2cd2f8acc8b2358/main.py#L25).
**SafeMTData_1K**: SafeMTData_1K focuses on the safety alignment of LLMs in multi-turn interactions, which consists of 1680 safe multi-turn dialogues. This dataset is curated based on [circuit breaker](https://github.com/GraySwanAI/circuit-breakers/blob/main/data/circuit_breakers_train.json) training dataset, which has been filtered to avoid data contamination with Harmbench. It contains harmful multi-turn queries created by ActorAttack as well as refusal responses to reject the harmful queries.
## 😃 Dataset Details
SafeMTData_1K dataset comprises the following columns:
- **id**: Unique identifier for all samples.
- **query_id**: Identifier for every malicious query.
- **actor_name**: Name of the actor which is semantically linked to the harmful query as the attack clue. Each query can involve **multiple actors**.
- **relationship**: The relationship between the actor and the harmful query.
- **plain_query**: The original plain malicious query.
- **conversations**: Our safe multi-turn conversations, containing harmful multi-turn queries and **refusal responses** to reject the harmful queries.
Attack_600 dataset comprises the following columns:
- **id**: Unique identifier for all samples.
- **query_id**: Identifier for every malicious query.
- **category**: The harmful category to which the query belongs.
- **actor_name**: Name of the actor which is semantically linked to the harmful query as the attack clue. Each query can involve **multiple actors**.
- **relationship**: The relationship between the actor and the harmful query.
- **plain_query**: The original plain malicious query.
- **multi_turn_queries**: The multi-turn jailbreak queries generated by **ActorAttack**.
## 🏆 Mini-Leaderboard
| Model | Attack Success Rate (%)
|----------------------------|:---------:|
| GPT-3.5-turbo-1106 | 78.5 |
| GPT-4o | 84.5 |
| Claude-3.5-sonnet | 66.5 |
| Llama-3-8B-instruct | 79.0 |
| Llama-3-70B | 85.5 |
| **GPT-o1** | **60.0** |
## ❌ Disclaimers
This dataset contains offensive content that may be disturbing, This benchmark is provided for educational and research purposes only.
## 📲 Contact
- Qibing Ren: renqibing@sjtu.edu.cn
- Hao Li: zy2442214@buaa.edu.cn
## 📖 BibTeX:
```bibtex
@misc{ren2024derailyourselfmultiturnllm,
title={Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues},
author={Qibing Ren and Hao Li and Dongrui Liu and Zhanxu Xie and Xiaoya Lu and Yu Qiao and Lei Sha and Junchi Yan and Lizhuang Ma and Jing Shao},
year={2024},
eprint={2410.10700},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.10700},
}
```
提供机构:
SafeMTData



