hh-rlhf-nosafe

Name: hh-rlhf-nosafe
Creator: maas
Published: 2026-01-06 16:50:01
License: 暂无描述

魔搭社区2026-01-06 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/maywell/hh-rlhf-nosafe

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for HH-RLHF ## Dataset Summary This repository provides access to two different kinds of data: 1. Human preference data about helpfulness and harmlessness from [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862). These data are meant to train preference (or reward) models for subsequent RLHF training. These data are *not* meant for supervised training of dialogue agents. Training dialogue agents on these data is likely to lead to harmful models and this shold be avoided. 2. Human-generated and annotated red teaming dialogues from [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://www.anthropic.com/red_teaming.pdf). These data are meant to understand how crowdworkers red team models and what types of red team attacks are succesful or not. The data are *not* meant for fine-tuning or preference modeling (use the data above for preference modeling). These data are entire transcripts of conversations that are derived from the harmlessness preference modeling data described above, where only the chosen response is incorporated into the overall transcript. Furthermore, the transcripts are annotated with human and automated measurements of how harmful the overall dialogues are. **Disclaimer**: The data (especially the harmlessness preference data and the red team data) contain content that may be offensive or upsetting. Topics include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially upsetting subject matter. Please only engage with the data in accordance with your own personal risk tolerance. The data are intended for research purposes, especially research that can make models *less* harmful. The views expressed in the data do not reflect the views of Anthropic or any of its employees. As mentioned above, these data are *not* intended for training dialogue agents as this will likely lead to harmful model behavior. Each of these datasets are described further below. ## Human preference data about helpfulness and harmlessness (PM Data) The data are described in the paper: [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862). If you find the data useful, please cite the paper. The data format is very simple -- each line of the jsonl files contains a pair of texts, one "chosen" and one "rejected". For **helpfulness**, the data are grouped into train/test splits in three tranches: from our base models (context-distilled 52B language models), via rejection sampling (mostly with best-of-16 sampling) against an early preference model, and a dataset sampled during our iterated "online" process. For **harmlessness**, the data are only collected for our base models, but otherwise formatted in the same way. Details about the data collection process and crowdworker population can be found in the paper, specifically in section 2 and appendix D. ## Red teaming data (not PM Data) The data are described in the paper: [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://www.anthropic.com/red_teaming.pdf). If you find the data useful, please cite the paper. At a high level these data can be examined to understand what succesful red team attempts look like. Details about the data and data collection procedures can be found in the Datasheet in the appendix of the paper. Each line of the jsonl file contains a dictionary with the following fields: - `transcript` a text transcript of a conversation between a human adversary (red team member) and an AI assistant - `min_harmlessness_score_transcript` a real value score of the harmlessness of the AI assistant (lower is more harmful) as obtained from a preference model - `num_params` number of parameters in the language model powering the AI assistant - `model_type` type of model powering the AI assistant - `rating` the red team member's rating of how successful they were at breaking the AI assistant (Likert scale, higher is more successful) - `task_description` a short text description written by the red team member about how they tried to red team the AI assistant - `task_description_harmlessness_score` a real value score of the harmlessness of the task description (lower is more harmful) as obtained from a preference model - `red_team_member_id` an arbitrary identifier of the red team member. one red team member can generate multiple red team attacks - `is_upworker` a binary indicator that is true if the red team member was from the crowd platform Upwork or false if they were from MTurk - `tags` a list of up to 6 tags per transcript. tags are short descriptions of the red team attempts generated by crowdworkers who reviewed red team data post-hoc. tags were only provided for a random sample of 1000 red team attempts for two of four model types. ## Usage Each of the above datasets is located in a separate sub-directory. To load an individual subset, use the `data_dir` argument of the `load_dataset()` function as follows: ```python from datasets import load_dataset # Load all helpfulness/harmless subsets (share the same schema) dataset = load_dataset("Anthropic/hh-rlhf") # Load one of the harmless subsets dataset = load_dataset("Anthropic/hh-rlhf", data_dir="harmless-base") # Load the red teaming subset dataset = load_dataset("Anthropic/hh-rlhf", data_dir="red-team-attempts") ``` ## Contact The original authors host this dataset on GitHub here: https://github.com/anthropics/hh-rlhf You can submit inquiries to: redteam@anthropic.com

# HH-RLHF 数据集卡片 ## 数据集概述本仓库提供两种不同类型的数据访问权限： 1. 来自《Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback》的关于有用性与无害性的人类偏好数据，此类数据旨在为后续人类反馈强化学习（Reinforcement Learning from Human Feedback, RLHF）训练构建偏好（或奖励）模型，**不适用于对话智能体的监督训练**。在该数据上训练对话智能体可能会生成有害模型，应避免此类操作。 2. 来自《Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned》的人类生成并标注的红队测试（red teaming）对话数据，此类数据用于理解众包工作者如何对模型开展红队测试，以及哪些类型的红队攻击能够成功或失败。**该数据不适用于微调或偏好建模**（偏好建模请使用前述偏好数据）。所有数据均为完整对话转录文本，源自上文所述的无害性偏好建模数据，仅将被选中的回复整合至整体转录文本中。此外，转录文本还标注了人类与自动化工具对整体对话有害程度的评估结果。 **免责声明**：本数据集（尤其是无害性偏好数据与红队测试数据）包含可能令人不适或冒犯性的内容，涉及但不限于歧视性语言、虐待、暴力、自残、剥削及其他潜在令人不适的主题。请根据自身风险承受能力谨慎使用本数据集。本数据集仅用于研究目的，尤其是旨在降低模型有害性的研究。数据中表达的观点不代表Anthropic及其员工的立场。如前文所述，本数据集**不适用于对话智能体的训练**，因为此类操作可能会导致模型产生有害行为。以下将分别详述两个数据集的具体信息。 ## 关于有用性与无害性的人类偏好数据（偏好建模数据，Preference Modeling Data, PM Data）本数据的详细说明参见论文《Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback》，若您觉得本数据有用，请引用该论文。数据格式十分简洁——jsonl文件的每一行均包含一对文本，分别为「被选中（chosen）」与「被拒绝（rejected）」的回复。针对**有用性**数据，其训练/测试划分分为三个批次：来自基础模型（上下文蒸馏的520亿参数大语言模型）、通过拒绝采样（主要基于早期偏好模型的16选1采样）、以及在迭代「在线」过程中采样得到的数据集。针对**无害性**数据，仅针对我们的基础模型进行收集，但格式与上述一致。数据收集流程与众包工作者群体的详细信息可参见论文的第2节与附录D。 ## 红队测试数据（非偏好建模数据）本数据的详细说明参见论文《Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned》，若您觉得本数据有用，请引用该论文。简言之，该数据集可用于分析成功的红队测试案例具有何种特征。关于数据与数据收集流程的详细信息可参见论文附录中的数据说明书（Datasheet）。 jsonl文件的每一行均为包含以下字段的字典： - `transcript`：人类攻击者（红队成员）与AI助手之间的对话文本转录 - `min_harmlessness_score_transcript`：由偏好模型得出的AI助手无害性实值评分（分值越低，有害性越高） - `num_params`：驱动AI助手的语言模型参数量 - `model_type`：驱动AI助手的模型类型 - `rating`：红队成员对自身突破AI助手成功率的评分（李克特量表（Likert scale），分值越高则成功率越高） - `task_description`：红队成员撰写的简短文本描述，说明其尝试对AI助手开展红队测试的方式 - `task_description_harmlessness_score`：由偏好模型得出的任务描述无害性实值评分（分值越低，有害性越高） - `red_team_member_id`：红队成员的任意标识符，一名红队成员可发起多次红队攻击 - `is_upworker`：二元标识变量，若红队成员来自众包平台Upwork则为真，若来自Amazon Mechanical Turk（MTurk）则为假 - `tags`：每条转录文本最多6个标签的列表，标签为事后审阅红队数据的众包工作者生成的红队尝试简短描述。仅针对四种模型类型中的两种的1000条随机红队尝试样本提供了标签。 ## 使用方法上述两个数据集分别位于独立的子目录中。如需加载单个子集，请按以下方式使用`load_dataset()`函数的`data_dir`参数： python from datasets import load_dataset # 加载所有有用性/无害性子集（共享相同的数据结构） dataset = load_dataset("Anthropic/hh-rlhf") # 加载单个无害性子集 dataset = load_dataset("Anthropic/hh-rlhf", data_dir="harmless-base") # 加载红队测试子集 dataset = load_dataset("Anthropic/hh-rlhf", data_dir="red-team-attempts") ## 联系方式原始作者在GitHub上托管此数据集：https://github.com/anthropics/hh-rlhf。您可通过以下邮箱提交咨询：redteam@anthropic.com

提供机构：

maas

创建时间：

2025-10-13

搜集汇总

数据集介绍