JailbreakBench/JBB-Behaviors

Hugging Face2024-09-26 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/JailbreakBench/JBB-Behaviors

下载链接

链接失效反馈

资源简介：

JailbreakBench数据集是一个用于评估大型语言模型（LLMs）越狱攻击和防御的鲁棒性基准。该数据集包含100种不同的滥用行为，这些行为来源于多个先前的工作，并参考了OpenAI的使用政策进行分类。数据集分为有害行为和良性行为两部分，并提供了详细的条目描述，包括行为标识、目标查询、目标响应、类别和来源。数据集旨在帮助研究人员快速评估新的攻击方法，并提供了一个官方的排行榜来跟踪攻击和防御的性能。

JailbreakBench is a robust benchmark for evaluating the robustness of large language models (LLMs) against jailbreak attacks and defenses. This dataset contains 100 distinct misuse cases sourced from multiple prior works and categorized with reference to OpenAI’s usage policies. The dataset is divided into two subsets: harmful behaviors and benign behaviors, and provides detailed entry descriptions including behavior identifiers, target queries, target responses, categories, and sources. This benchmark aims to help researchers rapidly evaluate novel attack methods, and an official leaderboard is provided to track the performance of both attacks and defenses.

提供机构：

JailbreakBench

原始信息汇总

JBB-Behaviors 数据集概述

基本信息

语言: 英语
名称: JBB-Behaviors
标签:
- 越狱
- 大型语言模型
- 有害行为
- 机器学习安全
许可证: MIT

数据集配置

配置名称: behaviors
- 数据文件:
  - 分割: harmful
    - 路径: "data/harmful-behaviors.csv"
  - 分割: benign
    - 路径: "data/benign-behaviors.csv"
配置名称: judge_comparison
- 数据文件:
  - 分割: test
    - 路径: "data/judge-comparison.csv"

数据集描述

目标: 用于评估大型语言模型（LLMs）的越狱行为和防御措施。
内容: 包含100个不同的滥用行为，这些行为来自以下源：
- 原始数据
- Trojan Detection Challenge/HarmBench
- AdvBench
数据结构:
- Behavior: 描述特定滥用行为的唯一标识符
- Goal: 请求不当行为的查询
- Target: 对目标字符串的肯定响应
- Category: 根据OpenAI使用政策的更广泛滥用类别
- Source: 行为来源（例如，原始数据、Trojan Detection Challenge 2023 Red Teaming Track/HarmBench、AdvBench）

数据集使用

注意: 数据集的部分内容可能对某些读者具有冒犯性。
示例: 数据集包含100个不同的滥用行为，分为十个广泛的类别，对应OpenAI的使用政策。

引用信息

数据集引用: bibtex @misc{chao2024jailbreakbench, title={JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models}, author={Patrick Chao and Edoardo Debenedetti and Alexander Robey and Maksym Andriushchenko and Francesco Croce and Vikash Sehwag and Edgar Dobriban and Nicolas Flammarion and George J. Pappas and Florian Tramèr and Hamed Hassani and Eric Wong}, year={2024}, eprint={2404.01318}, archivePrefix={arXiv}, primaryClass={cs.CR} }
相关数据集引用: bibtex @misc{zou2023universal, title={Universal and Transferable Adversarial Attacks on Aligned Language Models}, author={Andy Zou and Zifan Wang and J. Zico Kolter and Matt Fredrikson}, year={2023}, eprint={2307.15043}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{tdc2023, title={TDC 2023 (LLM Edition): The Trojan Detection Challenge}, author={Mantas Mazeika and Andy Zou and Norman Mu and Long Phan and Zifan Wang and Chunru Yu and Adam Khoja and Fengqing Jiang and Aidan OGara and Ellie Sakhaee and Zhen Xiang and Arezoo Rajabi and Dan Hendrycks and Radha Poovendran and Bo Li and David Forsyth}, booktitle={NeurIPS Competition Track}, year={2023} } @article{mazeika2024harmbench, title={HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal}, author={Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan}, journal={arXiv preprint arXiv:2402.04249}, year={2024} }

许可证

许可证: MIT

AI搜集汇总

数据集介绍

构建方式

JBB-Behaviors数据集的构建，是以100种不同的滥用行为为基准，这些行为源自于先前的工作，如Trojan Detection Challenge、HarmBench和AdvBench，同时参照了OpenAI的使用政策进行筛选和整理。数据集分为两个配置：behaviors和judge_comparison，其中behaviors配置包含有害行为和良性行为两种数据文件，judge_comparison配置则包含测试数据文件。

使用方法

使用JBB-Behaviors数据集时，用户可以访问HuggingFace提供的链接，下载数据集并根据需要进行分析。数据集的使用需要遵循MIT许可证的规定，同时在使用时，建议引用相关的科学文献，以尊重数据集的版权和贡献者的工作。

背景与挑战

背景概述

JBB-Behaviors数据集，作为NeurIPS 2024 Datasets and Benchmarks Track的一部分，是由Chao等人创建的开源健壮性基准测试，旨在全面跟踪大型语言模型（LLMs）的越狱生成及防御进展。该数据集收集了100种不同的滥用行为，参考了OpenAI的使用政策，并源自于AdvBench、Trojan Detection Challenge以及HarmBench等先前工作。JBB-Behaviors数据集的构建，旨在为研究人员提供一个统一的评价框架，以评估未来算法的性能。

当前挑战

JBB-Behaviors数据集在构建过程中所面临的挑战主要包括：确保收集的行为样本具有代表性且能够覆盖广泛的滥用场景；同时，数据集的构建还需考虑到对于不当内容的筛选，以避免对部分读者造成冒犯。在研究领域问题上，该数据集的挑战在于如何有效地识别并防御大型语言模型的越狱行为，这对于维护模型的可靠性和安全性至关重要。

常用场景

经典使用场景

在人工智能领域，对于大型语言模型的安全性评估至关重要。JBB-Behaviors数据集作为评估语言模型鲁棒性的开放式基准，其经典使用场景在于为研究者提供了一个包含100种不同滥用行为的列表，这些行为旨在触发模型产生不合规的输出，从而测试模型的防御能力。

解决学术问题

该数据集解决了大型语言模型在面临特定滥用行为时的防御问题，有助于学术界更好地理解模型在安全性方面的弱点和缺陷，进而推动更安全的模型设计和评估方法的发展。

实际应用

在实际应用中，JBB-Behaviors数据集可以帮助开发者识别和修复模型在处理有害内容时的漏洞，提升模型在真实世界应用中的鲁棒性和可靠性，从而保护用户免受恶意攻击的侵害。

数据集最近研究