SafeMTData

Name: SafeMTData
Creator: maas
Published: 2026-05-12 18:39:53
License: 暂无描述

魔搭社区2026-05-12 更新2024-10-19 收录

下载链接：

https://modelscope.cn/datasets/Shanghai_AI_Laboratory/SafeMTData

下载链接

链接失效反馈

官方服务：

资源简介：

# 💥Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [**🌐 GitHub**](https://github.com/renqibing/ActorAttack) | [**🛎 Paper**](https://arxiv.org/abs/2410.10700) ## If you like our project, please give us a star ⭐ on Hugging Face for the latest update. ## 📰 News | Date | Event | |------------|----------| | **2024/10/14** | 🔥 We have released our dataset and posted our paper on Arxiv.| ## 📥 Using our dataset via huggingface Dataset ```python from datasets import load_dataset Attack_600 = load_dataset("SafeMTData/SafeMTData", 'Attack_600')["Attack_600"] SafeMTData_1K = load_dataset("SafeMTData/SafeMTData", 'SafeMTData_1K')["SafeMTData_1K"] ``` ## 🚀 Dataset Overview **Attack_600**: Attack_600 is a collection of 600 harmful multi-turn queries aimed at identifying the alignment vulnerabilities within LLMs in the multi-turn interactions. Attack_600 expands on 200 single-turn malicious queries of [Harmbench](https://www.harmbench.org) to 600 multi-turn queries, which contain 3 different attack paths for each original query. **For reproducibility**: Attack_600 is the initial multi-turn jailbreak data generated by ActorAttack without exploiting the responses of the victim model. However, ActorAttack would dynamically modify the initial jailbreak queries based on the responses of the victim model. Therefore, for reproducibility, when jailbreaking a model, you need to apply [dynamic modification](https://github.com/renqibing/ActorAttack/blob/master/inattack.py) to Attack_600 to fully evaluate the effectiveness of ActorAttack. For example, you can pass Attack_600 through the "pre_attack_data_path" parameter in the [''in_attack_config''](https://github.com/renqibing/ActorAttack/blob/eabe0692665ab0a3e9fa1c03d2cd2f8acc8b2358/main.py#L25). **SafeMTData_1K**: SafeMTData_1K focuses on the safety alignment of LLMs in multi-turn interactions, which consists of 1680 safe multi-turn dialogues. This dataset is curated based on [circuit breaker](https://github.com/GraySwanAI/circuit-breakers/blob/main/data/circuit_breakers_train.json) training dataset, which has been filtered to avoid data contamination with Harmbench. It contains harmful multi-turn queries created by ActorAttack as well as refusal responses to reject the harmful queries. ## 😃 Dataset Details SafeMTData_1K dataset comprises the following columns: - **id**: Unique identifier for all samples. - **query_id**: Identifier for every malicious query. - **actor_name**: Name of the actor which is semantically linked to the harmful query as the attack clue. Each query can involve **multiple actors**. - **relationship**: The relationship between the actor and the harmful query. - **plain_query**: The original plain malicious query. - **conversations**: Our safe multi-turn conversations, containing harmful multi-turn queries and **refusal responses** to reject the harmful queries. Attack_600 dataset comprises the following columns: - **id**: Unique identifier for all samples. - **query_id**: Identifier for every malicious query. - **category**: The harmful category to which the query belongs. - **actor_name**: Name of the actor which is semantically linked to the harmful query as the attack clue. Each query can involve **multiple actors**. - **relationship**: The relationship between the actor and the harmful query. - **plain_query**: The original plain malicious query. - **multi_turn_queries**: The multi-turn jailbreak queries generated by **ActorAttack**. ## 🏆 Mini-Leaderboard | Model | Attack Success Rate (%) |----------------------------|:---------:| | GPT-3.5-turbo-1106 | 78.5 | | GPT-4o | 84.5 | | Claude-3.5-sonnet | 66.5 | | Llama-3-8B-instruct | 79.0 | | Llama-3-70B | 85.5 | | **GPT-o1** | **60.0** | ## ❌ Disclaimers This dataset contains offensive content that may be disturbing, This benchmark is provided for educational and research purposes only. ## 📲 Contact - Qibing Ren: renqibing@sjtu.edu.cn - Hao Li: zy2442214@buaa.edu.cn ## 📖 BibTeX: ```bibtex @misc{ren2024derailyourselfmultiturnllm, title={Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues}, author={Qibing Ren and Hao Li and Dongrui Liu and Zhanxu Xie and Xiaoya Lu and Yu Qiao and Lei Sha and Junchi Yan and Lizhuang Ma and Jing Shao}, year={2024}, eprint={2410.10700}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.10700}, } ```

# 💥打破防线：基于自主发现线索的多轮大语言模型越狱攻击 [**🌐 GitHub**](https://github.com/renqibing/ActorAttack) | [**🛎 论文**](https://arxiv.org/abs/2410.10700) ## 如果您喜爱我们的项目，请前往Hugging Face为我们点亮⭐以获取最新更新。 ## 📰 最新动态 | 日期 | 事件 | |------------|----------| | **2024/10/14** | 🔥 我们已发布数据集并在Arxiv上上传了论文。 ## 📥 借助Hugging Face数据集工具使用本数据集 python from datasets import load_dataset Attack_600 = load_dataset("SafeMTData/SafeMTData", 'Attack_600')["Attack_600"] SafeMTData_1K = load_dataset("SafeMTData/SafeMTData", 'SafeMTData_1K')["SafeMTData_1K"] ## 🚀 数据集概览 **Attack_600**：该数据集包含600条有害多轮查询，旨在检测大语言模型（LLM）在多轮交互中的对齐漏洞。Attack_600基于[Harmbench](https://www.harmbench.org)的200条单轮恶意查询扩展而来，共生成600条多轮查询，每条原始查询对应3种不同的攻击路径。 **可复现性说明**：Attack_600是由ActorAttack生成的初始多轮越狱数据，未利用受害模型的回复。但ActorAttack会基于受害模型的回复动态修改初始越狱查询。因此，若要复现实验，在对模型进行越狱攻击时，您需要将[动态修改脚本](https://github.com/renqibing/ActorAttack/blob/master/inattack.py)应用于Attack_600，以全面评估ActorAttack的攻击效果。例如，您可以通过[in_attack_config](https://github.com/renqibing/ActorAttack/blob/eabe0692665ab0a3e9fa1c03d2cd2f8acc8b2358/main.py#L25)中的`pre_attack_data_path`参数传入Attack_600数据集。 **SafeMTData_1K**：该数据集聚焦于大语言模型在多轮交互中的安全对齐任务，包含1680条安全多轮对话。本数据集基于[circuit breaker](https://github.com/GraySwanAI/circuit-breakers/blob/main/data/circuit_breakers_train.json)训练数据集整理而成，并经过过滤以避免与Harmbench数据集产生数据污染。数据集内包含由ActorAttack生成的有害多轮查询，以及用于拒绝这些有害查询的拒答回复。 ## 😃 数据集字段说明 SafeMTData_1K 数据集包含以下字段： - **id**：所有样本的唯一标识符。 - **query_id**：每条恶意查询的标识符。 - **actor_name**：与有害查询在语义上相关联的角色名称，用作攻击线索。每条查询可涉及**多个角色**。 - **relationship**：该角色与有害查询之间的关联关系。 - **plain_query**：原始的直白恶意查询。 - **conversations**：本数据集的安全多轮对话内容，包含有害多轮查询以及用于拒绝这些有害查询的**拒答回复**。 Attack_600 数据集包含以下字段： - **id**：所有样本的唯一标识符。 - **query_id**：每条恶意查询的标识符。 - **category**：该查询所属的有害内容类别。 - **actor_name**：与有害查询在语义上相关联的角色名称，用作攻击线索。每条查询可涉及**多个角色**。 - **relationship**：该角色与有害查询之间的关联关系。 - **plain_query**：原始的直白恶意查询。 - **multi_turn_queries**：由**ActorAttack**生成的多轮越狱攻击查询。 ## 🏆 小型评测榜单 | 模型名称 | 攻击成功率（%） |----------------------------|:---------:| | GPT-3.5-turbo-1106 | 78.5 | | GPT-4o | 84.5 | | Claude-3.5-sonnet | 66.5 | | Llama-3-8B-instruct | 79.0 | | Llama-3-70B | 85.5 | | **GPT-o1** | **60.0** | ## ❌ 免责声明本数据集包含可能令人不适的冒犯性内容，本基准测试仅用于教育与研究目的。 ## 📲 联系方式 - 任奇冰：renqibing@sjtu.edu.cn - 李浩：zy2442214@buaa.edu.cn ## 📖 BibTeX引用: bibtex @misc{ren2024derailyourselfmultiturnllm, title={Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues}, author={Qibing Ren and Hao Li and Dongrui Liu and Zhanxu Xie and Xiaoya Lu and Yu Qiao and Lei Sha and Junchi Yan and Lizhuang Ma and Jing Shao}, year={2024}, eprint={2410.10700}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.10700}, }

提供机构：

maas

创建时间：

2024-10-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集