hh-rlhf

Name: hh-rlhf
Creator: maas
Published: 2026-05-23 07:01:28
License: 暂无描述

魔搭社区2026-05-23 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/hh-rlhf

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for HH-RLHF ## Dataset Summary This repository provides access to two different kinds of data: 1. Human preference data about helpfulness and harmlessness from [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862). These data are meant to train preference (or reward) models for subsequent RLHF training. These data are *not* meant for supervised training of dialogue agents. Training dialogue agents on these data is likely to lead to harmful models and this shold be avoided. 2. Human-generated and annotated red teaming dialogues from [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://www.anthropic.com/red_teaming.pdf). These data are meant to understand how crowdworkers red team models and what types of red team attacks are succesful or not. The data are *not* meant for fine-tuning or preference modeling (use the data above for preference modeling). These data are entire transcripts of conversations that are derived from the harmlessness preference modeling data described above, where only the chosen response is incorporated into the overall transcript. Furthermore, the transcripts are annotated with human and automated measurements of how harmful the overall dialogues are. **Disclaimer**: The data (especially the harmlessness preference data and the red team data) contain content that may be offensive or upsetting. Topics include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially upsetting subject matter. Please only engage with the data in accordance with your own personal risk tolerance. The data are intended for research purposes, especially research that can make models *less* harmful. The views expressed in the data do not reflect the views of Anthropic or any of its employees. As mentioned above, these data are *not* intended for training dialogue agents as this will likely lead to harmful model behavior. Each of these datasets are described further below. ## 示例代码 ```python from modelscope import MsDataset from modelscope.utils.constant import DownloadMode from modelscope import snapshot_download ds = MsDataset.load("AI-ModelScope/hh-rlhf", split='train') # harmless-base # ds = MsDataset.load("AI-ModelScope/hh-rlhf",subset_name='helpful-base', split='train') # ds = MsDataset.load("AI-ModelScope/hh-rlhf",subset_name='helpful-online', split='train') # ds = MsDataset.load("AI-ModelScope/hh-rlhf",subset_name='helpful-rejection-sampled', split='train') # ds = MsDataset.load("AI-ModelScope/hh-rlhf",subset_name='red-team-attempts', split='train') print(next(iter(ds))) ``` ## Human preference data about helpfulness and harmlessness (PM Data) The data are described in the paper: [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862). If you find the data useful, please cite the paper. The data format is very simple -- each line of the jsonl files contains a pair of texts, one "chosen" and one "rejected". For **helpfulness**, the data are grouped into train/test splits in three tranches: from our base models (context-distilled 52B language models), via rejection sampling (mostly with best-of-16 sampling) against an early preference model, and a dataset sampled during our iterated "online" process. For **harmlessness**, the data are only collected for our base models, but otherwise formatted in the same way. Details about the data collection process and crowdworker population can be found in the paper, specifically in section 2 and appendix D. ## Red teaming data (not PM Data) The data are described in the paper: [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://www.anthropic.com/red_teaming.pdf). If you find the data useful, please cite the paper. At a high level these data can be examined to understand what succesful red team attempts look like. Details about the data and data collection procedures can be found in the Datasheet in the appendix of the paper. Each line of the jsonl file contains a dictionary with the following fields: - `transcript` a text transcript of a conversation between a human adversary (red team member) and an AI assistant - `min_harmlessness_score_transcript` a real value score of the harmlessness of the AI assistant (lower is more harmful) as obtained from a preference model - `num_params` number of parameters in the language model powering the AI assistant - `model_type` type of model powering the AI assistant - `rating` the red team member's rating of how successful they were at breaking the AI assistant (Likert scale, higher is more successful) - `task_description` a short text description written by the red team member about how they tried to red team the AI assistant - `task_description_harmlessness_score` a real value score of the harmlessness of the task description (lower is more harmful) as obtained from a preference model - `red_team_member_id` an arbitrary identifier of the red team member. one red team member can generate multiple red team attacks - `is_upworker` a binary indicator that is true if the red team member was from the crowd platform Upwork or false if they were from MTurk - `tags` a list of up to 6 tags per transcript. tags are short descriptions of the red team attempts generated by crowdworkers who reviewed red team data post-hoc. tags were only provided for a random sample of 1000 red team attempts for two of four model types. ``` ## Contact The original authors host this dataset on GitHub here: https://github.com/anthropics/hh-rlhf You can submit inquiries to: redteam@anthropic.com

# HH-RLHF 数据集卡片 ## 数据集概述本仓库提供两类数据的访问渠道： 1. 源自论文《Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback》（链接：https://arxiv.org/abs/2204.05862）的有用性与无害性人类偏好数据。此类数据用于训练偏好（或奖励）模型，以供后续基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）训练使用。**注意：本数据集不得用于对话智能体的监督训练**，在该数据集上训练对话智能体可能会生成有害模型，应避免此类操作。 2. 源自论文《Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned》（链接：https://www.anthropic.com/red_teaming.pdf）的人类生成并标注的红队测试对话（red teaming dialogues）数据。此类数据用于研究众包工作者如何对模型开展红队测试，以及哪些红队攻击手段能够成功或失效。**注意：本数据集不得用于微调或偏好建模（偏好建模请使用前述第一类数据）**。该数据集源自前述无害性偏好建模数据，仅将被选中的回复纳入完整对话记录中；此外，所有对话记录均标注了人工与自动化手段测得的整体对话有害程度。 **免责声明**：本数据集（尤其是无害性偏好数据与红队测试数据）包含可能冒犯或令人不适的内容，涉及但不限于歧视性语言、虐待、暴力、自残、剥削及其他潜在令人不安的主题。请仅根据自身风险承受能力使用本数据集。本数据集仅用于研究目的，尤其是旨在降低模型有害性的研究。数据中表达的观点不代表Anthropic及其雇员的立场。如前所述，本数据集不得用于训练对话智能体，否则可能导致模型产生有害行为。两类数据集的详细说明如下文所述。 ## 示例代码 python from modelscope import MsDataset from modelscope.utils.constant import DownloadMode from modelscope import snapshot_download ds = MsDataset.load("AI-ModelScope/hh-rlhf", split='train') # harmless-base # ds = MsDataset.load("AI-ModelScope/hh-rlhf",subset_name='helpful-base', split='train') # ds = MsDataset.load("AI-ModelScope/hh-rlhf",subset_name='helpful-online', split='train') # ds = MsDataset.load("AI-ModelScope/hh-rlhf",subset_name='helpful-rejection-sampled', split='train') # ds = MsDataset.load("AI-ModelScope/hh-rlhf",subset_name='red-team-attempts', split='train') print(next(iter(ds))) ## 有用性与无害性人类偏好数据（偏好建模数据，Preference Modeling Data, PM Data）该数据集的详细说明参见论文《Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback》（链接：https://arxiv.org/abs/2204.05862）。若您使用本数据集，请引用该论文。数据集格式十分简洁：JSON Lines（jsonl）格式文件的每一行均包含一对文本，分别为「被选中回复（chosen）」与「被拒绝回复（rejected）」。在**有用性**维度，数据集分为训练集与测试集，共包含三个批次：源自基础模型（经上下文蒸馏的520亿参数大语言模型）、基于早期偏好模型通过拒绝采样（主要采用16选1采样）生成，以及在迭代「在线」流程中采样得到的数据集。在**无害性**维度，数据集仅采集自基础模型，但格式与前述一致。数据集采集流程与众包工作者招募细节可参见论文第2节与附录D。 ## 红队测试数据（非偏好建模数据）该数据集的详细说明参见论文《Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned》（链接：https://www.anthropic.com/red_teaming.pdf）。若您使用本数据集，请引用该论文。总体而言，通过分析该数据集可了解成功的红队测试攻击具备哪些特征。数据集与采集流程的详细说明可参见论文附录中的数据集手册（Datasheet）。 JSON Lines格式文件的每一行均为一个字典，包含以下字段： - `transcript`：人类红队测试者与AI助手的完整对话记录 - `min_harmlessness_score_transcript`：基于偏好模型测得的AI助手对话无害性实值得分（得分越低，有害性越高） - `num_params`：驱动AI助手的语言模型参数量 - `model_type`：驱动AI助手的模型类型 - `rating`：红队测试成功率评分：红队测试者对自身突破AI助手能力的评分（李克特量表，得分越高表示成功率越高） - `task_description`：红队测试者撰写的简短文本，描述其针对AI助手的红队测试方法 - `task_description_harmlessness_score`：基于偏好模型测得的任务描述无害性实值得分（得分越低，有害性越高） - `red_team_member_id`：红队测试者唯一标识：红队测试者的任意标识，一名红队测试者可发起多次红队攻击 - `is_upworker`：二进制标识，若红队测试者来自众包平台Upwork则为真，否则为假（即来自MTurk平台） - `tags`：标签列表：每条对话记录最多包含6个标签，由事后审核红队数据的众包工作者生成，用于描述红队测试尝试的类型。标签仅为四种模型类型中的两种的1000条随机采样样本提供。 ## 联系方式原始作者将本数据集托管于GitHub：https://github.com/anthropics/hh-rlhf。您可通过邮箱redteam@anthropic.com提交咨询。

提供机构：

maas

创建时间：

2024-11-10

搜集汇总

数据集介绍