SafeMTData/SafeMTData

Name: SafeMTData/SafeMTData
Creator: SafeMTData
Published: 2024-11-21 04:40:40
License: 暂无描述

Hugging Face2024-11-21 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/SafeMTData/SafeMTData

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - question-answering pretty_name: SafeMTData size_categories: - 1K<n<10K configs: - config_name: SafeMTData_1K data_files: - split: SafeMTData_1K path: SafeMTData/SafeMTData_1K.json - config_name: Attack_600 data_files: - split: Attack_600 path: SafeMTData/Attack_600.json --- # 💥Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [**🌐 GitHub**](https://github.com/renqibing/ActorAttack) | [**🛎 Paper**](https://arxiv.org/abs/2410.10700) ## If you like our project, please give us a star ⭐ on Hugging Face for the latest update. ## 📰 News | Date | Event | |------------|----------| | **2024/10/14** | 🔥 We have released our dataset and posted our paper on Arxiv.| ## 📥 Using our dataset via huggingface Dataset ```python from datasets import load_dataset Attack_600 = load_dataset("SafeMTData/SafeMTData", 'Attack_600')["Attack_600"] SafeMTData_1K = load_dataset("SafeMTData/SafeMTData", 'SafeMTData_1K')["SafeMTData_1K"] ``` ## 🚀 Dataset Overview **Attack_600**: Attack_600 is a collection of 600 harmful multi-turn queries aimed at identifying the alignment vulnerabilities within LLMs in the multi-turn interactions. Attack_600 expands on 200 single-turn malicious queries of [Harmbench](https://www.harmbench.org) to 600 multi-turn queries, which contain 3 different attack paths for each original query. **For reproducibility**: Attack_600 is the initial multi-turn jailbreak data generated by ActorAttack without exploiting the responses of the victim model. However, ActorAttack would dynamically modify the initial jailbreak queries based on the responses of the victim model. Therefore, for reproducibility, when jailbreaking a model, you need to apply [dynamic modification](https://github.com/renqibing/ActorAttack/blob/master/inattack.py) to Attack_600 to fully evaluate the effectiveness of ActorAttack. For example, you can pass Attack_600 through the "pre_attack_data_path" parameter in the [''in_attack_config''](https://github.com/renqibing/ActorAttack/blob/eabe0692665ab0a3e9fa1c03d2cd2f8acc8b2358/main.py#L25). **SafeMTData_1K**: SafeMTData_1K focuses on the safety alignment of LLMs in multi-turn interactions, which consists of 1680 safe multi-turn dialogues. This dataset is curated based on [circuit breaker](https://github.com/GraySwanAI/circuit-breakers/blob/main/data/circuit_breakers_train.json) training dataset, which has been filtered to avoid data contamination with Harmbench. It contains harmful multi-turn queries created by ActorAttack as well as refusal responses to reject the harmful queries. ## 😃 Dataset Details SafeMTData_1K dataset comprises the following columns: - **id**: Unique identifier for all samples. - **query_id**: Identifier for every malicious query. - **actor_name**: Name of the actor which is semantically linked to the harmful query as the attack clue. Each query can involve **multiple actors**. - **relationship**: The relationship between the actor and the harmful query. - **plain_query**: The original plain malicious query. - **conversations**: Our safe multi-turn conversations, containing harmful multi-turn queries and **refusal responses** to reject the harmful queries. Attack_600 dataset comprises the following columns: - **id**: Unique identifier for all samples. - **query_id**: Identifier for every malicious query. - **category**: The harmful category to which the query belongs. - **actor_name**: Name of the actor which is semantically linked to the harmful query as the attack clue. Each query can involve **multiple actors**. - **relationship**: The relationship between the actor and the harmful query. - **plain_query**: The original plain malicious query. - **multi_turn_queries**: The multi-turn jailbreak queries generated by **ActorAttack**. ## 🏆 Mini-Leaderboard | Model | Attack Success Rate (%) |----------------------------|:---------:| | GPT-3.5-turbo-1106 | 78.5 | | GPT-4o | 84.5 | | Claude-3.5-sonnet | 66.5 | | Llama-3-8B-instruct | 79.0 | | Llama-3-70B | 85.5 | | **GPT-o1** | **60.0** | ## ❌ Disclaimers This dataset contains offensive content that may be disturbing, This benchmark is provided for educational and research purposes only. ## 📲 Contact - Qibing Ren: renqibing@sjtu.edu.cn - Hao Li: zy2442214@buaa.edu.cn ## 📖 BibTeX: ```bibtex @misc{ren2024derailyourselfmultiturnllm, title={Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues}, author={Qibing Ren and Hao Li and Dongrui Liu and Zhanxu Xie and Xiaoya Lu and Yu Qiao and Lei Sha and Junchi Yan and Lizhuang Ma and Jing Shao}, year={2024}, eprint={2410.10700}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.10700}, } ```

提供机构：

SafeMTData

5,000+

优质数据集

54 个

任务类型

进入经典数据集