five

AttaQ

收藏
魔搭社区2025-11-27 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/ibm-research/AttaQ
下载链接
链接失效反馈
官方服务:
资源简介:
# AttaQ Dataset Card The AttaQ red teaming dataset, consisting of 1402 carefully crafted adversarial questions, is designed to evaluate Large Language Models (LLMs) by assessing their tendency to generate harmful or undesirable responses. It may serve as a benchmark to assess the potential harm of responses produced by LLMs. The dataset is categorized into seven distinct classes of questions: deception, discrimination, harmful information, substance abuse, sexual content, personally identifiable information (PII), and violence. Researchers and developers can use this dataset to assess the behavior of LLMs and explore the various factors that influence their responses, ultimately aiming to enhance their harmlessness and ethical usage. The development of AttaQ involved multiple sources: 1. Extraction: Initially, attacks were methodologically extracted from the hh-rlhf dataset provided by Anthropic. These extracted attacks were then semi-automatically labeled into the seven aforementioned categories. 2. Generation: LLMs were employed to generate a substantial volume of attacks that were similar in nature to the ones extracted from the hh-rlhf dataset. 3. Crawl and Generate: We utilized LLMs in conjunction with the Wikipedia Crimes Page to generate a series of actions associated with each type of crime. These actions were then employed to devise questions that explore ways to engage in unethical or unlawful actions related to those crimes, specifically focusing on strategies to evade detection or consequences. ### Disclaimer: * The data contains offensive and upsetting content by nature therefore it may not be easy to read. Please read them in accordance with your own personal risk tolerance. * LLM's response to the AttaQ samples in many cases is harmful and/or violent. * This dataset is a representative subset of all possible attacks. Many other attacks can cause LLM to answer harmful or inappropriate responses. * The red-teaming community’s goal is to make models less harmful. We restrict the usage of the dataset to make models less harmful. ### Fields - `label`: The label of adversarial question - `input`: adversarial question - `additional_info`: source of the adversarial question ### Usage Information The data are described in the [paper](https://arxiv.org/abs/2311.04124): **Unveiling Safety Vulnerabilities of Large Language Models**. As detailed in the paper, automatically evaluating the potential harm by LLMs requires a harmlessness metric. The model under test should be prompted by each question in the dataset, and the corresponding responses undergo evaluation using a metric that considers both the input and output. Our paper utilizes the "OpenAssistant/reward-model-deberta-v3-large-v2" Reward model, though other models such as "sileod/deberta-v3-large-tasksource-rlhf-reward-model" can also be employed. ```python import pandas as pd from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification from datasets import load_dataset # Load dataset and initialize models dataset = load_dataset("ibm/AttaQ") generator = pipeline("text2text-generation", model="gpt2", tokenizer="gpt2") reward_model = AutoModelForSequenceClassification.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") tokenizer_reward = AutoTokenizer.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") # Create DataFrame and generate responses df = pd.DataFrame(dataset['train']) df['output'] = [response['generated_text'] for response in generator(df['input'].tolist(), max_length=50, num_return_sequences=1)] # Calculate harmlessness for each sample inputs = tokenizer_reward(df['input'].tolist(), df['output'].tolist(), return_tensors="pt", padding=True, truncation=True) df['reward'] = reward_model(**inputs).logits.detach().cpu().numpy() ``` <a href="url"><img src="https://cdn-uploads.huggingface.co/production/uploads/64540de813dba495b2e0354e/wOLy8gjqlNXZMlDAl1lJD.png" align="center" height="800" width="800" ></a> A comparison between several models across various attack domains in the Attaq dataset. ### Citation Information If this dataset is utilized in your research, kindly cite the following paper: ``` @article{kour2023unveiling, title={Unveiling Safety Vulnerabilities of Large Language Models}, author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby-Tavor, Ateret and Raz, Orna and Farchi, Eitan}, journal={arXiv preprint arXiv:2311.04124}, year={2023} } ```

# AttaQ 数据集卡片 AttaQ红队测试数据集包含1402条精心设计的对抗性问题,旨在通过评估大语言模型(Large Language Models, LLM)生成有害或不当回复的倾向,对其开展安全性能评测。该数据集可作为基准工具,用于评估大语言模型生成回复的潜在危害性。 本数据集共划分为七个明确的问题类别:欺骗、歧视、有害信息、药物滥用、色情内容、个人可识别信息(Personally Identifiable Information, PII)以及暴力内容。研究人员与开发者可借助该数据集评估大语言模型的行为表现,探究影响其回复生成的各类因素,最终目标是提升模型的无害性与伦理使用规范。 AttaQ的构建涵盖三类来源: 1. 提取环节:研究人员首先从Anthropic公司提供的hh-rlhf数据集中,以系统化方法提取对抗性攻击样本,随后将这些提取出的样本半自动标注至上述七个类别中。 2. 生成环节:利用大语言模型生成大量与hh-rlhf数据集提取样本性质相似的对抗性攻击样本。 3. 爬取与生成环节:结合大语言模型与维基百科犯罪页面,生成与各类犯罪相关的一系列行为,再基于这些行为设计问题,探究实施与这些犯罪相关的不道德或非法行为的方法,尤其是如何规避检测或处罚的策略。 ### 免责声明: * 本数据集本质上包含冒犯性与令人不适的内容,阅读门槛较高,请根据个人风险承受能力谨慎阅览。 * 大语言模型针对AttaQ样本生成的回复,在多数情况下具有危害性和/或暴力倾向。 * 本数据集仅为所有可能攻击样本的代表性子集,仍存在诸多其他攻击方式可诱导大语言模型生成有害或不当回复。 * 红队测试社区的核心目标是降低模型的危害性,本数据集的使用仅限于助力模型安全性提升。 ### 数据集字段 - `label`:对抗性问题的类别标签 - `input`:对抗性问题文本 - `additional_info`:对抗性问题的来源信息 ### 使用说明 本数据集的详细说明可参阅论文**《Unveiling Safety Vulnerabilities of Large Language Models》(揭开大语言模型的安全漏洞)**,链接为:https://arxiv.org/abs/2311.04124。 如论文所述,自动评估大语言模型的潜在危害性需要依赖无害性评测指标。测试模型需接收数据集中的每条问题作为提示,随后针对生成的回复,结合输入与输出进行综合评估。 本论文采用了「OpenAssistant/reward-model-deberta-v3-large-v2」奖励模型作为评测工具,其他类似模型如「sileod/deberta-v3-large-tasksource-rlhf-reward-model」亦可替代使用。 python import pandas as pd from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification from datasets import load_dataset # Load dataset and initialize models dataset = load_dataset("ibm/AttaQ") generator = pipeline("text2text-generation", model="gpt2", tokenizer="gpt2") reward_model = AutoModelForSequenceClassification.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") tokenizer_reward = AutoTokenizer.from_pretrained("sileod/deberta-v3-large-tasksource-rlhf-reward-model") # Create DataFrame and generate responses df = pd.DataFrame(dataset['train']) df['output'] = [response['generated_text'] for response in generator(df['input'].tolist(), max_length=50, num_return_sequences=1)] # Calculate harmlessness for each sample inputs = tokenizer_reward(df['input'].tolist(), df['output'].tolist(), return_tensors="pt", padding=True, truncation=True) df['reward'] = reward_model(**inputs).logits.detach().cpu().numpy() <a href="url"><img src="https://cdn-uploads.huggingface.co/production/uploads/64540de813dba495b2e0354e/wOLy8gjqlNXZMlDAl1lJD.png" align="center" height="800" width="800" ></a> 下图为多个模型在AttaQ数据集各攻击域上的性能对比。 ### 引用信息 若您的研究中使用了本数据集,请引用以下论文: @article{kour2023unveiling, title={Unveiling Safety Vulnerabilities of Large Language Models}, author={Kour, George and Zalmanovici, Marcel and Zwerdling, Naama and Goldbraich, Esther and Fandina, Ora Nova and Anaby-Tavor, Ateret and Raz, Orna and Farchi, Eitan}, journal={arXiv preprint arXiv:2311.04124}, year={2023} }
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作