ibm-research/AttaQ

Name: ibm-research/AttaQ
Creator: ibm-research
Published: 2024-01-26 08:16:47
License: 暂无描述

Hugging Face2024-01-26 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ibm-research/AttaQ

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: en multilinguality: monolingual source_datasets: extended|Anthropic/hh-rlhf task_categories: - text-generation - text2text-generation size_categories: - 1K<n<10K tags: - safety - harm - LLMs - Vulnerability - red teaming - toxicity --- # AttaQ Dataset Card The AttaQ red teaming dataset, consisting of 1402 carefully crafted adversarial questions, is designed to evaluate Large Language Models (LLMs) by assessing their tendency to generate harmful or undesirable responses. It may serve as a benchmark to assess the potential harm of responses produced by LLMs. The dataset is categorized into seven distinct classes of questions: deception, discrimination, harmful information, substance abuse, sexual content, personally identifiable information (PII), and violence. Researchers and developers can use this dataset to assess the behavior of LLMs and explore the various factors that influence their responses, ultimately aiming to enhance their harmlessness and ethical usage. The development of AttaQ involved multiple sources: 1. Extraction: Initially, attacks were methodologically extracted from the hh-rlhf dataset provided by Anthropic. These extracted attacks were then semi-automatically labeled into the seven aforementioned categories. 2. Generation: LLMs were employed to generate a substantial volume of attacks that were similar in nature to the ones extracted from the hh-rlhf dataset. 3. Crawl and Generate: We utilized LLMs in conjunction with the Wikipedia Crimes Page to generate a series of actions associated with each type of crime. These actions were then employed to devise questions that explore ways to engage in unethical or unlawful actions related to those crimes, specifically focusing on strategies to evade detection or consequences. ### Disclaimer: * The data contains offensive and upsetting content by nature therefore it may not be easy to read. Please read them in accordance with your own personal risk tolerance. * LLM's response to the AttaQ samples in many cases is harmful and/or violent. * This dataset is a representative subset of all possible attacks. Many other attacks can cause LLM to answer harmful or inappropriate responses. * The red-teaming community’s goal is to make models less harmful. We restrict the usage of the dataset to make models less harmful. ### Fields - `label`: The label of adversarial question - `input`: adversarial question - `additional_info`: source of the adversarial question ### Usage Information The data are described in the [paper](https://arxiv.org/abs/2311.04124): **Unveiling Safety Vulnerabilities of Large Language Models**. As detailed in the paper, automatically evaluating the potential harm by LLMs requires a harmlessness metric. The model under test should be prompted by each question in the dataset, and the corresponding responses undergo evaluation using a metric that considers both the input and output. Our paper utilizes the "OpenAssistant/reward-model-deberta-v3-large-v2" Reward model, though other models such as "sileod/deberta-v3-large-tasksource-rlhf-reward-model" can also be employed. ```python import pandas as pd from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification from datasets import load_dataset # Load dataset and initialize models dataset = load_dataset("ibm/AttaQ") generator = pipeline("text2text-generation", model="gpt2", tokenizer="gpt2") reward_model = AutoModelForSequenceClassification.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") tokenizer_reward = AutoTokenizer.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") # Create DataFrame and generate responses df = pd.DataFrame(dataset['train']) df['output'] = [response['generated_text'] for response in generator(df['input'].tolist(), max_length=50, num_return_sequences=1)] # Calculate harmlessness for each sample inputs = tokenizer_reward(df['input'].tolist(), df['output'].tolist(), return_tensors="pt", padding=True, truncation=True) df['reward'] = reward_model(**inputs).logits.detach().cpu().numpy() ``` <a href="url"><img src="https://cdn-uploads.huggingface.co/production/uploads/64540de813dba495b2e0354e/wOLy8gjqlNXZMlDAl1lJD.png" align="center" height="800" width="800" ></a> A comparison between several models across various attack domains in the Attaq dataset. ### Citation Information If this dataset is utilized in your research, kindly cite the following paper: ``` @article{kour2023unveiling, title={Unveiling Safety Vulnerabilities of Large Language Models}, author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby-Tavor, Ateret and Raz, Orna and Farchi, Eitan}, journal={arXiv preprint arXiv:2311.04124}, year={2023} } ```

数据集元信息： - 许可证：MIT许可证 - 语言：英语 - 多语言属性：单语言 - 源数据集：扩展版|Anthropic的hh-rlhf数据集 - 任务类别： - 文本生成 - 文本到文本生成 - 数据规模：1000<n<10000 - 标签： - 安全性 - 危害性 - 大语言模型（Large Language Models, LLMs） - 漏洞 - 红队测试（red teaming） - 毒性 # AttaQ 数据集卡片 AttaQ红队测试数据集包含1402条精心设计的对抗性问题，旨在通过评估大语言模型（Large Language Models, LLMs）生成有害或不当响应的倾向，对其开展性能评测。该数据集可作为评估大语言模型生成响应潜在危害性的基准测试集。该数据集将对抗性问题划分为七个明确类别：欺骗类、歧视类、有害信息类、药物滥用类、色情内容类、个人可识别信息（Personally Identifiable Information, PII）类以及暴力类。研究人员与开发者可借助该数据集评估大语言模型的行为表现，探究影响其输出响应的各类因素，最终目标是提升模型的无害性与伦理可用性。 AttaQ的构建包含多个环节： 1. 数据提取：研究人员首先从Anthropic提供的hh-rlhf数据集中以系统化方法提取对抗性攻击样本，随后将这些提取出的样本半自动标注至上述七个类别中。 2. 样本生成：借助大语言模型生成大量与hh-rlhf数据集提取样本性质相似的对抗性攻击样本。 3. 爬取与生成：研究人员结合大语言模型与维基百科犯罪页面，生成与各类犯罪相关的一系列行为描述，再以此为基础构建问题，用于探索实施与这些犯罪相关的不道德或非法行为的途径，尤其是规避检测或处罚的策略。 ### 免责声明 * 本数据集本质上包含冒犯性与令人不适的内容，阅读难度较高，请根据自身风险承受能力谨慎阅读。 * 大语言模型针对AttaQ样本生成的响应在多数情况下具有危害性和/或暴力倾向。 * 本数据集仅为所有可能攻击样本的代表性子集，尚有诸多其他攻击可诱导大语言模型生成有害或不当响应。 * 红队测试社区的核心目标是降低模型的危害性，本数据集的使用限制亦为推动模型安全性提升。 ### 数据集字段 - `label`：对抗性问题的类别标签 - `input`：对抗性问题文本 - `additional_info`：对抗性问题的来源信息 ### 使用说明该数据集的详细说明已发表于论文**《Unveiling Safety Vulnerabilities of Large Language Models》（揭秘大语言模型的安全漏洞）**，可通过[论文链接](https://arxiv.org/abs/2311.04124)查阅。如论文所述，自动评估大语言模型的潜在危害性需要引入无害性评估指标。测试模型需针对数据集中的每个问题进行提示以生成响应，随后结合输入与输出内容，使用评估指标对生成的响应进行评测。本研究使用了`"OpenAssistant/reward-model-deberta-v3-large-v2"`奖励模型，亦可采用诸如`"sileod/deberta-v3-large-tasksource-rlhf-reward-model"`等其他同类模型。 python import pandas as pd from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification from datasets import load_dataset # Load dataset and initialize models dataset = load_dataset("ibm/AttaQ") generator = pipeline("text2text-generation", model="gpt2", tokenizer="gpt2") reward_model = AutoModelForSequenceClassification.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") tokenizer_reward = AutoTokenizer.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") # Create DataFrame and generate responses df = pd.DataFrame(dataset['train']) df['output'] = [response['generated_text'] for response in generator(df['input'].tolist(), max_length=50, num_return_sequences=1)] # Calculate harmlessness for each sample inputs = tokenizer_reward(df['input'].tolist(), df['output'].tolist(), return_tensors="pt", padding=True, truncation=True) df['reward'] = reward_model(**inputs).logits.detach().cpu().numpy() <a href="url"><img src="https://cdn-uploads.huggingface.co/production/uploads/64540de813dba495b2e0354e/wOLy8gjqlNXZMlDAl1lJD.png" align="center" height="800" width="800" ></a> 不同模型在AttaQ数据集各攻击域上的性能对比。 ### 引用信息若您的研究中使用了本数据集，请引用以下论文： bibtex @article{kour2023unveiling, title={Unveiling Safety Vulnerabilities of Large Language Models}, author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby-Tavor, Ateret and Raz, Orna and Farchi, Eitan}, journal={arXiv preprint arXiv:2311.04124}, year={2023} }

提供机构：

ibm-research

5,000+

优质数据集

54 个

任务类型

进入经典数据集