PKU-SafeRLHF-30K

Name: PKU-SafeRLHF-30K
Creator: maas
Published: 2025-12-05 16:22:42
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-08 收录

下载链接：

https://modelscope.cn/datasets/PKU-Alignment/PKU-SafeRLHF-30K

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for PKU-SafeRLHF <span style="color: red;">Warning: this dataset contains data that may be offensive or harmful. The data are intended for research purposes, especially research that can make models less harmful. The views expressed in the data do not reflect the views of PKU-Alignment Team or any of its members. </span> ## Dataset Summary The preference dataset consists of 30k+ expert comparison data. Each entry in this dataset includes two responses to a question, along with safety meta-labels and preferences for both responses, taking into consideration their helpfulness and harmlessness. The annotation pipeline for this dataset is depicted in the following image: ![Annotation Pipeline](annotation-pipeline.png) ## Human-Preference on Harmlessness and Helpfulness ### Harmlessness of a QA Pair The harmlessness of a QA pair is evaluated based on ethical considerations and safety implications inherent in the interaction. A QA pair is deemed harmless when it is classified as risk-neutral across all 14 harm categories outlined in our paper. By risk-neutral, we mean that the QA pair does not engender or promote any harmful consequences or risks as per the definitions in these categories. Thus, a risk-neutral QA pair neither incite harm nor leads to unsafe outcomes, effectively aligning with our safety and ethical guidelines. ### Helpfulness of a Response The helpfulness of a response pertains to how effectively it addresses a given prompt. This measure is independent of the harmlessness of the response, as it focuses solely on the quality, clarity, and relevance of the provided information. Consequently, the helpfulness judgment can be distinctly different from the harmlessness judgment. For instance, consider a situation where a user asks about the procedure to synthesize methamphetamine. In such a case, a detailed, step-by-step response would be considered helpful due to its accuracy and thoroughness. However, due to the harmful implications of manufacturing illicit substances, this QA pair would be classified as extremely harmful. ### Ranking of Responses Once the helpfulness and harmlessness of responses are evaluated, they are ranked accordingly. It is important to note that this is a two-dimensional ranking: responses are ranked separately for helpfulness and harmlessness. This is due to the distinctive and independent nature of these two attributes. The resulting rankings provide a nuanced perspective on the responses, allowing us to balance information quality with safety and ethical considerations. These separate rankings of helpfulness and harmlessness contribute to a more comprehensive understanding of LLM outputs, particularly in the context of safety alignment. We have enforced a logical order to ensure the correctness of the harmlessness ranking: harmless responses (i.e. all 14 harm categories risk-neutral) are always ranked higher than harmful ones (i.e., at least 1 category risky). ## Usage To load our dataset, use the `load_dataset()` function as follows: ```python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF-30K") ``` ## Paper You can find more information in our paper - **Dataset Paper:** <https://arxiv.org/abs/2307.04657> ## Contact The original authors host this dataset on GitHub here: https://github.com/PKU-Alignment/beavertails.

# PKU-SafeRLHF 数据集卡片 <span style="color: red;">警告：本数据集包含可能具有冒犯性或危害性的数据。此类数据仅用于研究目的，尤其是旨在提升模型安全性的相关研究。数据中体现的观点不代表北京大学对齐团队（PKU-Alignment Team）及其任何成员的立场。</span> ## 数据集概述该偏好数据集包含3万余条专家对比数据。数据集中的每一条样本均包含针对某一问题的两条回复，以及兼顾回复有用性与无害性的安全元标签和针对两条回复的偏好标注。本数据集的标注流程如下图所示： ![标注流程](annotation-pipeline.png) ## 基于无害性与有用性的人类偏好 ### 问答对的无害性问答对的无害性基于交互过程中蕴含的伦理考量与安全影响进行评估。若某一问答对在本文提出的全部14类伤害类别中均被归类为风险中性，则该问答对被认定为无害。所谓风险中性，即该问答对不会根据上述类别的定义产生或助长任何有害后果或风险。因此，风险中性的问答对既不会引发伤害，也不会导致不安全结果，完全契合本团队的安全与伦理准则。 ### 回复的有用性回复的有用性指其针对给定提示的响应有效性。该指标与回复的无害性相互独立，仅关注所提供信息的质量、清晰度与相关性。因此，对有用性的判断与对无害性的判断可能存在显著差异。例如，假设用户询问合成甲基苯丙胺的步骤，此时一份详细的分步回复会因其准确性与全面性被认定为有用，但由于制造违禁物质具有危害性，该问答对将被归类为极具危害性。 ### 回复排序在完成回复的有用性与无害性评估后，将据此对回复进行排序。需特别说明的是，本数据集采用二维排序机制：分别基于有用性与无害性对回复进行独立排序，这是因为这两个属性具有独特且相互独立的本质。由此得到的排序能够从细微层面展现回复的特性，使我们能够在信息质量与安全伦理考量之间取得平衡。这两组独立的有用性与无害性排序有助于更全面地理解大语言模型（Large Language Model）的输出结果，尤其在安全对齐场景中。我们设定了逻辑严谨的排序规则以确保无害性排序的正确性：无害回复（即全部14类伤害类别均为风险中性的回复）的排序始终高于有害回复（即至少存在1类风险的回复）。 ## 使用方法如需加载本数据集，请使用`load_dataset()`函数，代码示例如下： python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF-30K") ## 论文您可通过以下论文获取更多相关信息： - **数据集论文**：<https://arxiv.org/abs/2307.04657> ## 联系方式本数据集的原作者将其托管于GitHub：https://github.com/PKU-Alignment/beavertails.

提供机构：

maas

创建时间：

2025-02-07

搜集汇总

数据集介绍