maywell/hh-rlhf-nosafe

Name: maywell/hh-rlhf-nosafe
Creator: maywell
Published: 2024-04-03 00:01:00
License: 暂无描述

Hugging Face2024-04-03 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/maywell/hh-rlhf-nosafe

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit tags: - human-feedback --- # Dataset Card for HH-RLHF ## Dataset Summary This repository provides access to two different kinds of data: 1. Human preference data about helpfulness and harmlessness from [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862). These data are meant to train preference (or reward) models for subsequent RLHF training. These data are *not* meant for supervised training of dialogue agents. Training dialogue agents on these data is likely to lead to harmful models and this shold be avoided. 2. Human-generated and annotated red teaming dialogues from [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://www.anthropic.com/red_teaming.pdf). These data are meant to understand how crowdworkers red team models and what types of red team attacks are succesful or not. The data are *not* meant for fine-tuning or preference modeling (use the data above for preference modeling). These data are entire transcripts of conversations that are derived from the harmlessness preference modeling data described above, where only the chosen response is incorporated into the overall transcript. Furthermore, the transcripts are annotated with human and automated measurements of how harmful the overall dialogues are. **Disclaimer**: The data (especially the harmlessness preference data and the red team data) contain content that may be offensive or upsetting. Topics include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially upsetting subject matter. Please only engage with the data in accordance with your own personal risk tolerance. The data are intended for research purposes, especially research that can make models *less* harmful. The views expressed in the data do not reflect the views of Anthropic or any of its employees. As mentioned above, these data are *not* intended for training dialogue agents as this will likely lead to harmful model behavior. Each of these datasets are described further below. ## Human preference data about helpfulness and harmlessness (PM Data) The data are described in the paper: [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862). If you find the data useful, please cite the paper. The data format is very simple -- each line of the jsonl files contains a pair of texts, one "chosen" and one "rejected". For **helpfulness**, the data are grouped into train/test splits in three tranches: from our base models (context-distilled 52B language models), via rejection sampling (mostly with best-of-16 sampling) against an early preference model, and a dataset sampled during our iterated "online" process. For **harmlessness**, the data are only collected for our base models, but otherwise formatted in the same way. Details about the data collection process and crowdworker population can be found in the paper, specifically in section 2 and appendix D. ## Red teaming data (not PM Data) The data are described in the paper: [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://www.anthropic.com/red_teaming.pdf). If you find the data useful, please cite the paper. At a high level these data can be examined to understand what succesful red team attempts look like. Details about the data and data collection procedures can be found in the Datasheet in the appendix of the paper. Each line of the jsonl file contains a dictionary with the following fields: - `transcript` a text transcript of a conversation between a human adversary (red team member) and an AI assistant - `min_harmlessness_score_transcript` a real value score of the harmlessness of the AI assistant (lower is more harmful) as obtained from a preference model - `num_params` number of parameters in the language model powering the AI assistant - `model_type` type of model powering the AI assistant - `rating` the red team member's rating of how successful they were at breaking the AI assistant (Likert scale, higher is more successful) - `task_description` a short text description written by the red team member about how they tried to red team the AI assistant - `task_description_harmlessness_score` a real value score of the harmlessness of the task description (lower is more harmful) as obtained from a preference model - `red_team_member_id` an arbitrary identifier of the red team member. one red team member can generate multiple red team attacks - `is_upworker` a binary indicator that is true if the red team member was from the crowd platform Upwork or false if they were from MTurk - `tags` a list of up to 6 tags per transcript. tags are short descriptions of the red team attempts generated by crowdworkers who reviewed red team data post-hoc. tags were only provided for a random sample of 1000 red team attempts for two of four model types. ## Usage Each of the above datasets is located in a separate sub-directory. To load an individual subset, use the `data_dir` argument of the `load_dataset()` function as follows: ```python from datasets import load_dataset # Load all helpfulness/harmless subsets (share the same schema) dataset = load_dataset("Anthropic/hh-rlhf") # Load one of the harmless subsets dataset = load_dataset("Anthropic/hh-rlhf", data_dir="harmless-base") # Load the red teaming subset dataset = load_dataset("Anthropic/hh-rlhf", data_dir="red-team-attempts") ``` ## Contact The original authors host this dataset on GitHub here: https://github.com/anthropics/hh-rlhf You can submit inquiries to: redteam@anthropic.com

提供机构：

maywell

原始信息汇总

数据集概述

本数据集包含两类数据：

人类偏好数据：
- 来源：Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback。
- 目的：用于训练偏好（奖励）模型，不适用于监督训练对话代理。
- 格式：JSONL文件，每行包含一对文本，一个“选中”和一个“拒绝”。
- 子集：
  - 帮助性：分为三个训练/测试分割，来自基础模型、拒绝采样和迭代“在线”过程。
  - 无害性：仅从基础模型收集，格式相同。
红队对话数据：
- 来源：Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned。
- 目的：理解人群工作者如何红队模型以及哪些类型的红队攻击成功或不成功。
- 格式：JSONL文件，每行包含一个字典，包含对话转录、无害性评分、模型参数数量、模型类型、红队成员的成功评级等信息。

数据使用

数据集位于单独的子目录中，可通过load_dataset()函数的data_dir参数加载特定子集。

联系方式

数据集托管在GitHub：https://github.com/anthropics/hh-rlhf。
查询邮箱：redteam@anthropic.com。

5,000+

优质数据集

54 个

任务类型

进入经典数据集