mhj
收藏魔搭社区2026-01-08 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/ScaleAI/mhj
下载链接
链接失效反馈官方服务:
资源简介:
<style>
button {
/* margin: calc(20vw / 100); */
margin: 0.5em;
padding-left: calc(40vw / 100);
padding-right: calc(40vw / 100);
padding-bottom: calc(0vw / 100);
text-align: center;
font-size: 12px;
height: 25px;
transition: 0.5s;
background-size: 200% auto;
color: white;
border-radius: calc(60vw / 100);
display: inline;
/* border: 2px solid black; */
font-weight: 500;
box-shadow: 0px 0px 14px -7px #f09819;
background-image: linear-gradient(45deg, #64F 0%, #000000 51%, #FF512F 100%);
cursor: pointer;
user-select: none;
-webkit-user-select: none;
touch-action: manipulation;
}
button:hover {
background-position: right center;
color: #fff;
text-decoration: none;
}
button:active {
transform: scale(0.95);
}
</style>
# Dataset Card for MHJ (08/27/2024)
<a href="https://scale.com/research/mhj" style="text-decoration:none">
<button>Homepage</button>
</a>
<a href="https://huggingface.co/ScaleAI/mhj-llama3-8b-rmu" style="text-decoration:none">
<button>Llama3-8B-RMU</button>
</a>
This card contains datasets for the paper *LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks*.
We compile successful jailbreaks into the Multi-Turn Human Jailbreaks
(MHJ) dataset, consisting of 2,912 prompts across 537 multi-turn conversations. We include
relevant metadata for each submission, including design choice comments from each red teamer for their
jailbreak. The resulting attack success rate (ASR) of our human red teaming is shown as follows.

[*CYGNET](https://www.grayswan.ai/) is closed
source, hence results for AutoDAN, GCG, and PAIR are cited from the original papers and should
not be directly compared against human ASR.
Due to potential export control concerns, we omitted some of the HarmBench behaviors and redacted some of the completions. We examined the HarmBench test set, standard and contextual behaviors for this research. For the same reason, we omitted the responses for WMDP-bio behaviors (if you just want to download the same WMDP-bio set we are using, please find here: https://huggingface.co/datasets/ScaleAI/mhj-wmdp-bio.
## Paper Abstract
Recent large language model (LLM) defenses have greatly improved models’ ability to refuse harmful
queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against
automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-
world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities,
exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs
with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning
defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile
these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn
jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens
of commercial red teaming engagements, supporting research towards stronger LLM defenses.
## Bibtex Citation
If you use this dataest, consider to cite
```
@misc{li2024llmdefensesrobustmultiturn,
title={LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet},
author={Nathaniel Li and Ziwen Han and Ian Steneker and Willow Primack and Riley Goodside and Hugh Zhang and Zifan Wang and Cristina Menghini and Summer Yue},
year={2024},
eprint={2408.15221},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.15221},
}
```
<style>
button {
/* margin: calc(20vw / 100); */
margin: 0.5em;
padding-left: calc(40vw / 100);
padding-right: calc(40vw / 100);
padding-bottom: calc(0vw / 100);
text-align: center;
font-size: 12px;
height: 25px;
transition: 0.5s;
background-size: 200% auto;
color: white;
border-radius: calc(60vw / 100);
display: inline;
/* border: 2px solid black; */
font-weight: 500;
box-shadow: 0px 0px 14px -7px #f09819;
background-image: linear-gradient(45deg, #64F 0%, #000000 51%, #FF512F 100%);
cursor: pointer;
user-select: none;
-webkit-user-select: none;
touch-action: manipulation;
}
button:hover {
background-position: right center;
color: #fff;
text-decoration: none;
}
button:active {
transform: scale(0.95);
}
</style>
# MHJ数据集卡片(2024年8月27日)
<a href="https://scale.com/research/mhj" style="text-decoration:none">
<button>主页</button>
</a>
<a href="https://huggingface.co/ScaleAI/mhj-llama3-8b-rmu" style="text-decoration:none">
<button>Llama3-8B-RMU</button>
</a>
本卡片对应论文《LLM防御无法抵御多轮人类越狱攻击》(*LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks*)所使用的数据集。
我们将成功的越狱攻击样本整理为**多轮人类越狱(Multi-Turn Human Jailbreaks, MHJ)**数据集,该数据集涵盖537轮多轮对话,总计2912条提示词。我们为每条提交的攻击样本提供了相关元数据,包括每位红队测试人员针对其越狱攻击的设计思路注释。本次人类红队测试得到的攻击成功率(Attack Success Rate, ASR)如下所示。

[CYGNET](https://www.grayswan.ai/) 为闭源模型,因此AutoDAN、GCG和PAIR的攻击结果均引自其原始论文,不可直接与人类红队测试的攻击成功率相比较。
鉴于潜在的出口管制风险,我们隐去了部分HarmBench基准测试集的攻击行为,并删除了部分生成结果。本研究中我们对HarmBench测试集的标准行为与上下文感知行为进行了测试。出于同样的原因,我们隐去了WMDP生物(WMDP-bio)行为对应的模型回复(若仅需下载本研究使用的WMDP-bio数据集,可访问:https://huggingface.co/datasets/ScaleAI/mhj-wmdp-bio)。
## 论文摘要
当前大语言模型(LLM)防御技术已大幅提升了模型拒绝有害查询的能力,即便在遭受对抗攻击时亦是如此。然而,现有LLM防御技术的评估主要基于单轮对话中的自动化对抗攻击,这一威胁模型不足以覆盖真实世界中的恶意使用场景。我们的研究表明,多轮人类越狱攻击能够揭示模型存在的显著安全漏洞:在HarmBench基准测试集上,针对那些在自动化单轮攻击下仅报告个位数攻击成功率的防御模型,我们的攻击成功率超过70%。人类越狱攻击还暴露了机器遗忘(machine unlearning)防御机制的漏洞,能够从未经遗忘的模型中恢复两用生物安全知识。我们将上述研究结果整理为多轮人类越狱(Multi-Turn Human Jailbreaks, MHJ)数据集,该数据集包含537轮多轮越狱攻击,共计2912条提示词。我们公开发布了MHJ数据集,同时附带了来自数十次商业红队测试项目中总结的越狱攻击战术汇编,旨在为构建更强健的LLM防御技术提供研究支持。
## BibTeX 引用
若使用本数据集,请引用以下文献:
@misc{li2024llmdefensesrobustmultiturn,
title={LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet},
author={Nathaniel Li and Ziwen Han and Ian Steneker and Willow Primack and Riley Goodside and Hugh Zhang and Zifan Wang and Cristina Menghini and Summer Yue},
year={2024},
eprint={2408.15221},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.15221},
}
提供机构:
maas
创建时间:
2025-09-23



