LibrAI/do-not-answer

Name: LibrAI/do-not-answer
Creator: LibrAI
Published: 2023-08-28 18:22:52
License: 暂无描述

Hugging Face2023-08-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/LibrAI/do-not-answer

下载链接

链接失效反馈

官方服务：

资源简介：

Do-Not-Answer是一个开源数据集，用于低成本评估大型语言模型（LLMs）的安全机制。该数据集经过精心策划和过滤，仅包含负责任的语言模型不应回答的提示。除了人工注释外，Do-Not-Answer还实施了基于模型的评估，其中600M的微调BERT-like评估器与人类和GPT-4的评估结果相当。数据集包含939条指令，这些指令被组织成一个三级层次分类法，涵盖61种具体的伤害类型。数据集评估了六个模型（GPT-4, ChatGPT, Claude, LLaMA-2, Vicuna, ChatGLM2）的响应，从有害性和行动类别两个层面进行评估。

Do-Not-Answer is an open-source dataset for low-cost evaluation of the safety mechanisms of Large Language Models (LLMs). This dataset has been meticulously curated and filtered to only include prompts that a responsible language model should refuse to respond to. Beyond manual annotation, Do-Not-Answer also adopts model-based evaluation: a 600-million-parameter fine-tuned BERT-like evaluator whose performance is comparable to that of human raters and GPT-4 evaluations. The dataset contains 939 instructions, which are organized into a three-level hierarchical taxonomy covering 61 specific harm categories. Responses of six models (GPT-4, ChatGPT, Claude, LLaMA-2, Vicuna, ChatGLM2) were evaluated across two dimensions: harmfulness and action category.

提供机构：

LibrAI

原始信息汇总

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Overview

Do not answer is an open-source dataset designed to evaluate the safety mechanisms of large language models (LLMs). The dataset includes 939 prompts that responsible language models should not answer. It incorporates both human annotations and model-based evaluations using a 600M fine-tuned BERT-like evaluator.

Dataset Features

Features:
- id: int64
- risk_area: string
- types_of_harm: string
- specific_harms: string
- question: string
- GPT4_response: string
- GPT4_harmful: int64
- GPT4_action: int64
- ChatGPT_response: string
- ChatGPT_harmful: int64
- ChatGPT_action: int64
- Claude_response: string
- Claude_harmful: int64
- Claude_action: int64
- ChatGLM2_response: string
- ChatGLM2_harmful: int64
- ChatGLM2_action: int64
- llama2-7b-chat_response: string
- llama2-7b-chat_harmful: int64
- llama2-7b-chat_action: int64
- vicuna-7b_response: string
- vicuna-7b_harmful: int64
- vicuna-7b_action: int64
Splits:
- train: 939 examples, 3997193 bytes

Response Assessment

Models Evaluated: GPT-4, ChatGPT, Claude, LLaMA-2, Vicuna, ChatGLM2
Assessment Levels:
- Harmfulness: Binary classification of whether a response is harmful.
- Action Category: Responses categorized into six distinct groups.

Human Evaluation

Safety Rankings: LLaMA-2 ranks first in harmlessness, followed by ChatGPT, Claude, GPT-4, and Vicuna, with ChatGLM2 ranking last.

Automatic Evaluation

Evaluator Performance: Fine-tuned Longformer classifier (600M) achieves comparable results to human and GPT-4 evaluations.

Citation

bibtex @misc{wang2023donotanswer, author = {Wang, Yuxia, and Li, Haonan and Han, Xudong and Nakov, Preslav and Baldwin, Timothy}, title = {Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs}, year = {2023}, howpublished = {arXiv preprint arXiv:2308.13387}, }

License

Dataset: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
Source Files: Apache 2.0 License

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，尤其是大型语言模型的安全性评估中，LibrAI/do-not-answer数据集通过精心策划和筛选，仅包含责任语言模型不应回答的提示。该数据集结合了人类注释和基于模型的评估，其中包含了一个600M参数的微调BERT-like评估器，其结果与人类评估和GPT-4相当。

特点

该数据集的特点在于其专注于安全性评估，具有三级指令分类体系，涵盖61种具体危害。数据集共收集了939条指令，分布跨五个风险领域和12种危害类型。此外，数据集通过两个层面（危害性和行为类别）对六个模型的响应进行评估，提供了一个全面的评价视角。

使用方法

使用该数据集时，研究者可以依据数据集中的提示，评估大型语言模型在安全性方面的表现。数据集提供了易于理解的评估指标，包括危害性排名和行为类别分布，同时支持自动评估和人类评估，使得研究者可以采用多种方法对模型进行评价。

背景与挑战

背景概述

在大型语言模型（LLM）的研究与应用领域，确保模型的安全性是至关重要的课题。Do-not-answer数据集应运而生，旨在评估LLM的安全机制，其由王宇霞、李豪男、韩旭东、Preslav Nakov和Timothy Baldwin等研究人员于2023年创建。该数据集的核心研究问题是检验LLM在遇到潜在风险提示时的安全响应机制，对相关领域的研究产生了显著影响，为评估LLM安全防护措施提供了低成本、高效能的解决方案。

当前挑战

Do-not-answer数据集在构建过程中面临的挑战主要包括：如何在海量的文本数据中精确筛选出不应由LLM回应的风险提示；如何构建一个包含多层次指令分类法的评估体系，涵盖61种具体伤害类型；以及如何通过人工评估和自动评估相结合的方式，确保评估结果的准确性和可靠性。此外，数据集在解决LLM领域问题，如避免生成有害内容方面的挑战，体现在对不同模型在安全性上的评估与比较，以及如何有效分类和量化模型响应的有害程度和应对措施。

常用场景

经典使用场景

在深入探索大型语言模型的安全性机制这一领域，LibrAI/do-not-answer数据集提供了一个独特的视角。该数据集专门收集了那些负责任的语言模型不应回答的提示，成为评估LLM安全机制低成本的有效工具。其经典使用场景在于，通过该数据集，研究人员可以对模型进行细致的安全性评估，确保模型在面对敏感或危险话题时能够恰当地不作出回应。

实际应用

在实际应用中，LibrAI/do-not-answer数据集可用于训练和测试语言模型，以确保它们能够识别并适当地处理那些不应回答的查询。这对于开发面向公众的语言模型尤为重要，因为它有助于防止模型在不适当的情况下生成有害内容，从而保护用户免受伤害。

衍生相关工作

该数据集催生了一系列相关研究，包括对各种语言模型在处理高风险内容时的表现进行比较分析，以及对模型安全机制进行改进的方法研究。这些衍生工作进一步扩展了LibrAI/do-not-answer的应用范围，推动了大型语言模型安全性的研究进展，对构建更安全、更可靠的语言模型具有深远影响。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集