PKU-SafeRLHF

Name: PKU-SafeRLHF
Creator: maas
Published: 2026-05-16 21:40:25
License: 暂无描述

魔搭社区2026-05-16 更新2025-02-08 收录

下载链接：

https://modelscope.cn/datasets/PKU-Alignment/PKU-SafeRLHF

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for PKU-SafeRLHF <span style="color: red;">Warning: this dataset contains data that may be offensive or harmful. The data are intended for research purposes, especially research that can make models less harmful. The views expressed in the data do not reflect the views of PKU-Alignment Team or any of its members. </span> [[🏠 Homepage](https://sites.google.com/view/pku-saferlhf)] [[🤗 Single Dimension Preference Dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-single-dimension)] [[🤗 Q-A Dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-QA)] [[🤗 Prompt Dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-prompt)] ## Citation If PKU-SafeRLHF has contributed to your work, please consider citing our research: ``` @article{ji2024beavertails, title={Beavertails: Towards improved safety alignment of llm via a human-preference dataset}, author={Ji, Jiaming and Liu, Mickel and Dai, Josef and Pan, Xuehai and Zhang, Chi and Bian, Ce and Chen, Boyuan and Sun, Ruiyang and Wang, Yizhou and Yang, Yaodong}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} } ``` ``` @article{ji2024pku, title={PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference}, author={Ji, Jiaming and Hong, Donghai and Zhang, Borong and Chen, Boyuan and Dai, Josef and Zheng, Boren and Qiu, Tianyi and Li, Boxun and Yang, Yaodong}, journal={arXiv preprint arXiv:2406.15513}, year={2024} } ``` If you encounter any issues with our dataset, please contact us through the HuggingFace Discussion. ## Dataset Summary This dataset is a sibling project of [PKU-SafeRLHF-v0](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF/tree/v0) and [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails). We provide a high-quality dataset consisting of 83.4K preference entries, which is annotated across two dimensions: harmlessness and helpfulness. Specifically, each entry in this dataset includes two responses to a question, accompanied by safety meta-labels and preferences for both responses based on their helpfulness and harmlessness. For a more fine-grained labeling of Q-A pairs in this dataset, see [PKU-SafeRLHF-QA](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-QA). In this work, we performed SFT on Llama2-7B and Llama3-8B with Alpaca 52K dataset, resulting in Alpaca2-7B and Alpaca3-8B. This dataset contains responses from Alpaca-7B, Alpaca2-7B, and Alpaca3-8B in the corresponding folders under /data. The data collection pipeline for this dataset is depicted in the following image: ![Data Collection Pipeline](data-collection-pipeline.png) ## Labels of Each Q-A Pair ### Harm Categories Each Q-A pair in the dataset is labeled with one or more harm categories. Based on existing research and multiple rounds of communication with the annotation team, we identify 19 harm categories: - S1: `Endangering National Security` - S2: `Insulting Behavior` - S3: `Discriminatory Behavior` - S4: `Endangering Public Health` - S5: `Copyright Issues` - S6: `Violence` - S7: `Drugs` - S8: `Privacy Violation` - S9: `Economic Crime` - S10: `Mental Manipulation` - S11: `Human Trafficking` - S12: `Physical Harm` - S13: `Sexual Content` - S14: `Cybercrime` - S15: `Disrupting Public Order` - S16: `Environmental Damage` - S17: `Psychological Harm` - S18: `White-Collar Crime` - S19: `Animal Abuse` ### Severity Level Security issues of large language models are not a simple binary right-or-wrong problem. We explore the classification of the inherent severity of harmful events and define three severity levels of harmful events from the aspects of impact and response measures: | Severity Level | Impact | Response Measures | |-----------------|-----------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------| | `Minor` | Low-level impact, slight personal discomfort, or minor societal impact in a limited range, not violating the law. | Simple personal measures are enough, and no government or professional agency intervention is needed. | | `Moderate` | Moderate impact, may cause noticeable damage or even death to individuals, difficult to mitigate in a short time, possibly violating the law, but usually does not lead to long-term severe consequences. | May require government or professional agency intervention. | | `Severe` | Targets groups rather than individuals. May cause extremely serious consequences, involving large-scale casualties, economic losses, environmental damage, and other malicious outcomes, with broad and far-reaching impact. | Requires national or international resources for response and recovery. | ## Human-Preference on Harmlessness and Helpfulness ### Harmlessness of a Q-A Pair The harmlessness of a Q-A pair is evaluated based on ethical considerations and safety implications inherent in the interaction. A Q-A pair is deemed harmless when it is classified as risk-neutral across all [19 harm categories](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-QA). By risk-neutral, we mean that the Q-A pair does not engender or promote any harmful consequences or risks as per the definitions in these categories. Thus, a risk-neutral Q-A pair neither incite harm nor leads to unsafe outcomes, effectively aligning with our safety and ethical guidelines. ### Helpfulness of a Response The helpfulness of a response pertains to how effectively it addresses a given prompt. This measure is independent of the harmlessness of the response, as it focuses solely on the quality, clarity, and relevance of the provided information. Consequently, the helpfulness judgment can be distinctly different from the harmlessness judgment. For instance, consider a situation where a user asks about the procedure to synthesize methamphetamine. In such a case, a detailed, step-by-step response would be considered helpful due to its accuracy and thoroughness. However, due to the harmful implications of manufacturing illicit substances, this Q-A pair would be classified as extremely harmful. ### Ranking of Responses Once the helpfulness and harmlessness of responses are evaluated, they are ranked accordingly. It is important to note that this is a two-dimensional ranking: responses are ranked separately for helpfulness and harmlessness. This is due to the distinctive and independent nature of these two attributes. The resulting rankings provide a nuanced perspective on the responses, allowing us to balance information quality with safety and ethical considerations. These separate rankings of helpfulness and harmlessness contribute to a more comprehensive understanding of LLM outputs, particularly in the context of safety alignment. We have enforced a logical order to ensure the correctness of the harmlessness ranking: harmless responses (i.e. all 19 harm categories risk-neutral) are always ranked higher than harmful ones (i.e., at least 1 category risky). ## Usage To load our dataset, use the `load_dataset()` function as follows: ```python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF") ``` To load a specified subset of our dataset, add the `data_dir` parameter. For example: ```python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF", name='alpaca-7b') ``` To load PKU-SafeRLHF-v0 (which is one of the sibling projects to this dataset): ```python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF", revision="v0") ```

# PKU-SafeRLHF 数据集卡片 <span style="color: red;">警告：本数据集包含可能具有冒犯性或危害性的内容。其仅用于研究目的，尤其是旨在提升模型安全性的相关研究。数据中表达的观点不代表PKU-Alignment团队及其任何成员的立场。</span> [[🏠 主页(https://sites.google.com/view/pku-saferlhf)]] [[🤗 单维度偏好数据集(https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-single-dimension)]] [[🤗 问答数据集(https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-QA)]] [[🤗 提示词数据集(https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-prompt)]] ## 引用说明若PKU-SafeRLHF对您的工作有所帮助，请引用我们的研究： @article{ji2024beavertails, title={Beavertails: Towards improved safety alignment of llm via a human-preference dataset}, author={Ji, Jiaming and Liu, Mickel and Dai, Josef and Pan, Xuehai and Zhang, Chi and Bian, Ce and Chen, Boyuan and Sun, Ruiyang and Wang, Yizhou and Yang, Yaodong}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} } @article{ji2024pku, title={PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference}, author={Ji, Jiaming and Hong, Donghai and Zhang, Borong and Chen, Boyuan and Dai, Josef and Zheng, Boren and Qiu, Tianyi and Li, Boxun and Yang, Yaodong}, journal={arXiv preprint arXiv:2406.15513}, year={2024} } 如您在使用本数据集时遇到任何问题，请通过HuggingFace讨论区联系我们。 ## 数据集概述本数据集是[PKU-SafeRLHF-v0](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF/tree/v0)与[BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails)的姊妹项目。我们构建了一个包含83.4万条偏好标注条目的高质量数据集，标注维度分为两项：无害性（harmlessness）与有用性（helpfulness）。具体而言，本数据集的每条条目均包含针对某一问题的两条回复，附带安全元标签，以及基于两条回复的有用性与无害性给出的偏好标注。如需对本数据集中的问答对进行更细粒度的标注，请参阅[PKU-SafeRLHF-QA](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-QA)。在本研究中，我们使用Alpaca 52K数据集对Llama2-7B与Llama3-8B进行了监督微调（Supervised Fine-Tuning, SFT），得到Alpaca2-7B与Alpaca3-8B。本数据集的`/data`目录下的对应子文件夹中包含Alpaca-7B、Alpaca2-7B与Alpaca3-8B生成的回复。本数据集的数据收集流程如下图所示： ![数据收集流程](data-collection-pipeline.png) ## 问答对的标注标签 ### 伤害类别本数据集中的每条问答对均被标注一个或多个伤害类别。基于现有研究与与标注团队的多轮沟通，我们确定了19类伤害类别： - S1：危害国家安全（Endangering National Security） - S2：侮辱性行为（Insulting Behavior） - S3：歧视性行为（Discriminatory Behavior） - S4：危害公共卫生（Endangering Public Health） - S5：版权问题（Copyright Issues） - S6：暴力内容（Violence） - S7：毒品相关（Drugs） - S8：侵犯隐私（Privacy Violation） - S9：经济犯罪（Economic Crime） - S10：精神操控（Mental Manipulation） - S11：人口贩卖（Human Trafficking） - S12：人身伤害（Physical Harm） - S13：色情内容（Sexual Content） - S14：网络犯罪（Cybercrime） - S15：扰乱公共秩序（Disrupting Public Order） - S16：破坏环境（Environmental Damage） - S17：心理伤害（Psychological Harm） - S18：白领犯罪（White-Collar Crime） - S19：虐待动物（Animal Abuse） ### 危害程度等级大语言模型（Large Language Model, LLM）的安全问题并非简单的二元对错问题。我们对有害事件的固有严重程度进行了分类，并从影响与应对措施两个维度定义了三类有害事件等级： | 危害程度等级 | 影响 | 应对措施 | |---------|-----------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------| | 轻微（Minor） | 低层级影响，仅造成轻微个人不适，或在有限范围内产生轻微社会影响，未违反法律。 | 仅需采取简单的个人应对措施，无需政府或专业机构介入。 | | 中等（Moderate） | 中等程度影响，可能对个人造成显著伤害甚至死亡，短时间内难以缓解，可能违反法律，但通常不会导致长期严重后果。 | 可能需要政府或专业机构介入。 | | 严重（Severe） | 针对群体而非个人。可能造成极其严重的后果，涉及大规模人员伤亡、经济损失、环境破坏等恶性结果，影响范围广泛且深远。 | 需要动用国家或国际资源进行应对与恢复。 | ## 基于无害性与有用性的人类偏好标注 ### 问答对的无害性问答对的无害性基于交互中固有的伦理考量与安全影响进行评估。当一个问答对在全部[19类伤害类别](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-QA)中均被归类为风险中性时，该问答对即被认定为无害。所谓风险中性，是指问答对不会引发或推动任何符合上述类别定义的有害后果或风险。因此，风险中性的问答对既不会煽动伤害，也不会导致不安全的结果，符合我们的安全与伦理准则。 ### 回复的有用性回复的有用性指其解决给定提示的有效性程度。该指标与回复的无害性相互独立，仅聚焦于所提供信息的质量、清晰度与相关性。因此，有用性判断与无害性判断可能存在显著差异。例如，假设用户询问合成甲基苯丙胺的步骤，此时一份详细的分步回复会因其准确性与全面性被视为有用，但由于制造违禁药品的危害性，该问答对将被归类为极具危害性。 ### 回复的排序在对回复的有用性与无害性进行评估后，将分别对其进行排序。需要注意的是，这是一个二维排序：回复将分别基于有用性与无害性进行排序。这是因为这两项属性具有显著的独立性。最终的排序能够从多维度评估回复，帮助我们在信息质量与安全伦理考量之间取得平衡。这两项独立的排序（有用性与无害性）有助于更全面地理解大语言模型的输出，尤其是在安全对齐的场景下。我们设置了逻辑规则以确保无害性排序的正确性：无害回复（即所有19类伤害类别均为风险中性）的排序始终高于有害回复（即至少存在一类风险的回复）。 ## 使用方法如需加载本数据集，请使用`load_dataset()`函数，示例如下： python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF") 如需加载指定子集，请添加`data_dir`参数。例如： python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF", name='alpaca-7b') 如需加载PKU-SafeRLHF-v0（本数据集的姊妹项目之一）： python from datasets import load_dataset dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF", revision="v0")

提供机构：

maas

创建时间：

2025-02-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集