JailbreakBench/JBB-Behaviors
收藏JBB-Behaviors 数据集概述
基本信息
- 语言: 英语
- 名称: JBB-Behaviors
- 标签:
- 越狱
- 大型语言模型
- 有害行为
- 机器学习安全
- 许可证: MIT
数据集配置
- 配置名称: behaviors
- 数据文件:
- 分割: harmful
- 路径: "data/harmful-behaviors.csv"
- 分割: benign
- 路径: "data/benign-behaviors.csv"
- 分割: harmful
- 数据文件:
- 配置名称: judge_comparison
- 数据文件:
- 分割: test
- 路径: "data/judge-comparison.csv"
- 分割: test
- 数据文件:
数据集描述
- 目标: 用于评估大型语言模型(LLMs)的越狱行为和防御措施。
- 内容: 包含100个不同的滥用行为,这些行为来自以下源:
- 原始数据
- Trojan Detection Challenge/HarmBench
- AdvBench
- 数据结构:
- Behavior: 描述特定滥用行为的唯一标识符
- Goal: 请求不当行为的查询
- Target: 对目标字符串的肯定响应
- Category: 根据OpenAI使用政策的更广泛滥用类别
- Source: 行为来源(例如,原始数据、Trojan Detection Challenge 2023 Red Teaming Track/HarmBench、AdvBench)
数据集使用
- 注意: 数据集的部分内容可能对某些读者具有冒犯性。
- 示例: 数据集包含100个不同的滥用行为,分为十个广泛的类别,对应OpenAI的使用政策。
引用信息
-
数据集引用: bibtex @misc{chao2024jailbreakbench, title={JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models}, author={Patrick Chao and Edoardo Debenedetti and Alexander Robey and Maksym Andriushchenko and Francesco Croce and Vikash Sehwag and Edgar Dobriban and Nicolas Flammarion and George J. Pappas and Florian Tramèr and Hamed Hassani and Eric Wong}, year={2024}, eprint={2404.01318}, archivePrefix={arXiv}, primaryClass={cs.CR} }
-
相关数据集引用: bibtex @misc{zou2023universal, title={Universal and Transferable Adversarial Attacks on Aligned Language Models}, author={Andy Zou and Zifan Wang and J. Zico Kolter and Matt Fredrikson}, year={2023}, eprint={2307.15043}, archivePrefix={arXiv}, primaryClass={cs.CL} } @inproceedings{tdc2023, title={TDC 2023 (LLM Edition): The Trojan Detection Challenge}, author={Mantas Mazeika and Andy Zou and Norman Mu and Long Phan and Zifan Wang and Chunru Yu and Adam Khoja and Fengqing Jiang and Aidan OGara and Ellie Sakhaee and Zhen Xiang and Arezoo Rajabi and Dan Hendrycks and Radha Poovendran and Bo Li and David Forsyth}, booktitle={NeurIPS Competition Track}, year={2023} } @article{mazeika2024harmbench, title={HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal}, author={Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan}, journal={arXiv preprint arXiv:2402.04249}, year={2024} }
许可证
- 许可证: MIT




