ibm/AttaQ

Hugging Face2024-01-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ibm/AttaQ

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: en multilinguality: monolingual source_datasets: extended|Anthropic/hh-rlhf task_categories: - text-generation - text2text-generation size_categories: - 1K<n<10K tags: - safety - harm - LLMs - Vulnerability - red teaming - toxicity --- # AttaQ Dataset Card The AttaQ red teaming dataset, consisting of 1402 carefully crafted adversarial questions, is designed to evaluate Large Language Models (LLMs) by assessing their tendency to generate harmful or undesirable responses. It may serve as a benchmark to assess the potential harm of responses produced by LLMs. The dataset is categorized into seven distinct classes of questions: deception, discrimination, harmful information, substance abuse, sexual content, personally identifiable information (PII), and violence. Researchers and developers can use this dataset to assess the behavior of LLMs and explore the various factors that influence their responses, ultimately aiming to enhance their harmlessness and ethical usage. The development of AttaQ involved multiple sources: 1. Extraction: Initially, attacks were methodologically extracted from the hh-rlhf dataset provided by Anthropic. These extracted attacks were then semi-automatically labeled into the seven aforementioned categories. 2. Generation: LLMs were employed to generate a substantial volume of attacks that were similar in nature to the ones extracted from the hh-rlhf dataset. 3. Crawl and Generate: We utilized LLMs in conjunction with the Wikipedia Crimes Page to generate a series of actions associated with each type of crime. These actions were then employed to devise questions that explore ways to engage in unethical or unlawful actions related to those crimes, specifically focusing on strategies to evade detection or consequences. ### Disclaimer: * The data contains offensive and upsetting content by nature therefore it may not be easy to read. Please read them in accordance with your own personal risk tolerance. * LLM's response to the AttaQ samples in many cases is harmful and/or violent. * This dataset is a representative subset of all possible attacks. Many other attacks can cause LLM to answer harmful or inappropriate responses. * The red-teaming community’s goal is to make models less harmful. We restrict the usage of the dataset to make models less harmful. ### Fields - `label`: The label of adversarial question - `input`: adversarial question - `additional_info`: source of the adversarial question ### Usage Information The data are described in the [paper](https://arxiv.org/abs/2311.04124): **Unveiling Safety Vulnerabilities of Large Language Models**. As detailed in the paper, automatically evaluating the potential harm by LLMs requires a harmlessness metric. The model under test should be prompted by each question in the dataset, and the corresponding responses undergo evaluation using a metric that considers both the input and output. Our paper utilizes the "OpenAssistant/reward-model-deberta-v3-large-v2" Reward model, though other models such as "sileod/deberta-v3-large-tasksource-rlhf-reward-model" can also be employed. ```python import pandas as pd from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification from datasets import load_dataset # Load dataset and initialize models dataset = load_dataset("ibm/AttaQ") generator = pipeline("text2text-generation", model="gpt2", tokenizer="gpt2") reward_model = AutoModelForSequenceClassification.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") tokenizer_reward = AutoTokenizer.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") # Create DataFrame and generate responses df = pd.DataFrame(dataset['train']) df['output'] = [response['generated_text'] for response in generator(df['input'].tolist(), max_length=50, num_return_sequences=1)] # Calculate harmlessness for each sample inputs = tokenizer_reward(df['input'].tolist(), df['output'].tolist(), return_tensors="pt", padding=True, truncation=True) df['reward'] = reward_model(**inputs).logits.detach().cpu().numpy() ``` <a href="url"><img src="https://cdn-uploads.huggingface.co/production/uploads/64540de813dba495b2e0354e/wOLy8gjqlNXZMlDAl1lJD.png" align="center" height="800" width="800" ></a> A comparison between several models across various attack domains in the Attaq dataset. ### Citation Information If this dataset is utilized in your research, kindly cite the following paper: ``` @article{kour2023unveiling, title={Unveiling Safety Vulnerabilities of Large Language Models}, author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby-Tavor, Ateret and Raz, Orna and Farchi, Eitan}, journal={arXiv preprint arXiv:2311.04124}, year={2023} } ```

许可证：MIT协议语言：英语多语言属性：单语源数据集：扩展|Anthropic/hh-rlhf 任务类别： - 文本生成 - 文本到文本生成规模类别： - 1000 < 样本数 < 10000 标签： - 安全性 - 有害性 - 大语言模型（LLMs） - 漏洞 - 红队测试（red teaming） - 毒性 # AttaQ 数据集卡片 AttaQ红队测试（red teaming）数据集包含1402条精心构建的对抗性问题，旨在通过评估大语言模型（Large Language Models, LLMs）生成有害或不当响应的倾向，对其开展安全性评估。该数据集可作为评估大语言模型生成响应潜在危害性的基准测试集。本数据集将对抗性问题划分为七个明确类别：欺骗、歧视、有害信息、药物滥用、色情内容、个人可识别信息（Personally Identifiable Information, PII）以及暴力内容。研究人员与开发者可借助该数据集评估大语言模型的行为表现，探究影响其响应生成的各类因素，最终目标是提升模型的无害性与伦理合规性应用水平。 AttaQ数据集的开发涵盖三个环节： 1. 数据提取：首先，我们从Anthropic公司提供的hh-rlhf数据集中以系统化方法提取对抗性攻击样本。随后，将提取得到的攻击样本半自动标注至前述七个类别中。 2. 样本生成：利用大语言模型生成大量与hh-rlhf数据集提取样本性质相似的对抗性攻击样本。 3. 爬取与生成：我们结合大语言模型与维基百科犯罪页面，生成与各类犯罪相关的一系列行为描述，随后基于这些描述构建旨在探究如何实施与这些犯罪相关的不道德或非法行为的问题，重点聚焦于规避检测或处罚的策略。 ### 免责声明 * 本数据集内容本质上包含冒犯性与令人不适的信息，可能难以阅读，请根据自身风险承受能力谨慎阅读。 * 大语言模型针对AttaQ样本生成的响应在多数情况下具有危害性和/或暴力倾向。 * 本数据集仅为所有可能对抗性攻击的代表性子集，存在诸多其他攻击可诱导大语言模型生成有害或不当响应。 * 红队测试社区的目标是降低模型的危害性，本数据集的使用目的同样为提升模型的安全性。 ### 数据字段 - `label`：对抗性问题的类别标签 - `input`：对抗性问题文本 - `additional_info`：对抗性问题的来源信息 ### 使用说明该数据集的详细描述可参见论文**《Unveiling Safety Vulnerabilities of Large Language Models》（揭开大语言模型的安全漏洞）**，链接为：https://arxiv.org/abs/2311.04124。正如论文中所述，自动评估大语言模型生成响应的潜在危害性需要引入无害性评估指标。测试模型需接收数据集中的每条问题作为提示，随后针对生成的响应，结合输入与输出内容使用评估指标进行危害性判定。本研究使用了`OpenAssistant/reward-model-deberta-v3-large-v2`奖励模型，也可使用其他类似模型，例如`sileod/deberta-v3-large-tasksource-rlhf-reward-model`。 python import pandas as pd from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification from datasets import load_dataset # 加载数据集并初始化模型 dataset = load_dataset("ibm/AttaQ") generator = pipeline("text2text-generation", model="gpt2", tokenizer="gpt2") reward_model = AutoModelForSequenceClassification.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") tokenizer_reward = AutoTokenizer.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") # 创建DataFrame并生成响应 df = pd.DataFrame(dataset['train']) df['output'] = [response['generated_text'] for response in generator(df['input'].tolist(), max_length=50, num_return_sequences=1)] # 计算每个样本的无害性得分 inputs = tokenizer_reward(df['input'].tolist(), df['output'].tolist(), return_tensors="pt", padding=True, truncation=True) df['reward'] = reward_model(**inputs).logits.detach().cpu().numpy() <a href="url"><img src="https://cdn-uploads.huggingface.co/production/uploads/64540de813dba495b2e0354e/wOLy8gjqlNXZMlDAl1lJD.png" align="center" height="800" width="800" ></a> 本图对比了多款模型在AttaQ数据集各攻击类别下的表现。 ### 引用信息若您的研究中使用了本数据集，请引用以下论文： @article{kour2023unveiling, title={Unveiling Safety Vulnerabilities of Large Language Models}, author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby-Tavor, Ateret and Raz, Orna and Farchi, Eitan}, journal={arXiv preprint arXiv:2311.04124}, year={2023} }

提供机构：

ibm

原始信息汇总

AttaQ Dataset Summary

基本信息

许可证: MIT
语言: 英语
多语言性: 单语种
来源数据集: 扩展自Anthropic/hh-rlhf
任务类别:
- 文本生成
- 文本到文本生成
大小类别: 1K<n<10K
标签:
- 安全
- 伤害
- 大型语言模型(LLMs)
- 脆弱性
- 红队测试
- 毒性

数据集描述

目的: 评估大型语言模型(LLMs)生成有害或不理想响应的倾向。
组成: 包含1402个精心设计的对抗性问题，分为七个类别：欺骗、歧视、有害信息、物质滥用、性内容、个人识别信息(PII)和暴力。
用途: 用于评估LLMs的行为和探索影响其响应的各种因素，旨在增强其无害性和伦理使用。

开发过程

提取: 从Anthropic的hh-rlhf数据集中方法论地提取攻击，并半自动标记为七个类别。
生成: 使用LLMs生成大量与hh-rlhf数据集中提取的攻击相似的攻击。
爬取与生成: 结合LLMs和Wikipedia的犯罪页面，生成与每种犯罪相关的行动，进而设计探索不道德或非法行为的问题，特别关注逃避检测或后果的策略。

数据字段

label: 对抗性问题的标签
input: 对抗性问题
additional_info: 对抗性问题的来源

使用信息

评估方法: 模型应通过数据集中的每个问题进行提示，并使用考虑输入和输出的度量标准评估响应。
评估模型: 使用"OpenAssistant/reward-model-deberta-v3-large-v2"奖励模型，也可使用"sileod/deberta-v3-large-tasksource-rlhf-reward-model"。

引用信息

@article{kour2023unveiling, title={Unveiling Safety Vulnerabilities of Large Language Models}, author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby-Tavor, Ateret and Raz, Orna and Farchi, Eitan}, journal={arXiv preprint arXiv:2311.04124}, year={2023} }

搜集汇总

数据集介绍

构建方式

AttaQ数据集的构建过程融合了多种策略，旨在系统性地评估大型语言模型（LLMs）在面对对抗性问题时的潜在危害性。首先，通过从Anthropic的hh-rlhf数据集中有策略地提取对抗性攻击样本，并对其进行半自动分类，涵盖了七个关键类别：欺骗、歧视、有害信息、物质滥用、性内容、个人身份信息（PII）以及暴力。其次，利用LLMs生成大量与提取样本性质相似的攻击问题，进一步丰富了数据集的多样性。最后，结合Wikipedia的犯罪页面，生成与各类犯罪相关的行动方案，设计出旨在探索非法或不道德行为的对抗性问题，特别是那些试图规避法律制裁的策略。

特点

AttaQ数据集的显著特点在于其对抗性问题的多样性和针对性。该数据集不仅涵盖了广泛的潜在危害类别，如欺骗、歧视和暴力等，还通过生成和爬取相结合的方式，确保了问题的复杂性和现实性。此外，数据集的每个问题都附带了详细的标签和来源信息，便于研究人员进行深入分析。值得注意的是，尽管数据集包含敏感和令人不适的内容，但其设计初衷是为了帮助识别和减少LLMs的潜在危害，从而推动更安全、更负责任的人工智能应用。

使用方法

AttaQ数据集主要用于评估和改进大型语言模型（LLMs）的安全性和无害性。使用该数据集时，研究人员可以通过提示模型生成对抗性问题的响应，并利用无害性评估指标对这些响应进行量化分析。具体操作中，可以加载数据集并初始化相应的生成模型和奖励模型，如GPT-2和deberta-v3-large-tasksource-rlhf-reward-model。通过生成模型对每个对抗性问题生成响应，并使用奖励模型计算响应的无害性得分，从而全面评估模型在不同攻击类别下的表现。

背景与挑战

背景概述

AttaQ数据集由IBM开发，旨在评估大型语言模型（LLMs）在面对对抗性问题时生成有害或不适当响应的倾向。该数据集包含1402个精心设计的对抗性问题，分为七个类别：欺骗、歧视、有害信息、物质滥用、性内容、个人身份信息（PII）和暴力。通过这些类别，研究人员和开发者可以评估LLMs的行为，探索影响其响应的各种因素，从而提升其无害性和伦理使用。AttaQ的开发结合了从Anthropic的hh-rlhf数据集中提取的攻击、使用LLMs生成的攻击，以及基于Wikipedia犯罪页面生成的攻击，旨在全面评估LLMs的安全性。

当前挑战

AttaQ数据集面临的挑战主要集中在两个方面。首先，构建过程中需要从多个来源提取和生成对抗性问题，这要求对数据进行细致的分类和标注，确保每个问题都能准确反映特定类型的危害。其次，评估LLMs的响应时，需要开发和应用无害性度量标准，以量化模型生成的响应的有害程度。此外，数据集中包含的敏感和冒犯性内容可能对研究人员和开发者的心理承受能力构成挑战，同时也需要在伦理和法律框架内合理使用该数据集，以确保其研究目的的正当性。

常用场景

经典使用场景

AttaQ数据集的经典使用场景在于评估大型语言模型（LLMs）在面对精心设计的对抗性问题时的响应行为。通过该数据集，研究者和开发者能够系统地测试模型在涉及欺骗、歧视、有害信息、物质滥用、性内容、个人身份信息（PII）及暴力等七大类问题上的表现，从而量化模型生成有害或不适当响应的倾向。

实际应用

在实际应用中，AttaQ数据集被广泛用于模型安全性的红队测试，即通过模拟恶意输入来发现和修复模型的安全漏洞。这种测试方法在部署前对模型进行全面的安全评估，确保其在面对真实世界复杂输入时能够保持安全性和伦理合规性，从而在医疗、法律和教育等多个领域中得到应用。

衍生相关工作

基于AttaQ数据集，研究者们开发了多种评估和改进大型语言模型安全性的方法。例如，相关工作包括使用奖励模型来量化模型响应的无害性，以及通过对抗性训练来增强模型的鲁棒性。此外，该数据集还激发了对自动化安全评估工具的研究，推动了红队测试技术的发展，为未来更安全的AI系统奠定了基础。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集