hh-rlhf-strength-cleaned

Name: hh-rlhf-strength-cleaned
Creator: maas
Published: 2025-12-05 16:55:17
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-22 收录

下载链接：

https://modelscope.cn/datasets/openmoss/hh-rlhf-strength-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for hh-rlhf-strength-cleaned **Other Language Versions: [English](README.md), [中文](README_zh.md).** ## Dataset Description In the paper titled "[Secrets of RLHF in Large Language Models Part II: Reward Modeling](https://arxiv.org/abs/2401.06080)" we measured the preference strength of each preference pair in the [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf "https://huggingface.co/datasets/Anthropic/hh-rlhf") dataset through model ensemble and annotated the valid set with GPT-4. In this repository, we provide: 1. Metadata of preference strength for both the training and valid sets. 2. GPT-4 annotations on the valid set. We mixed the hh-rlhf dataset and redivided it into a training set (151k) and a valid set (17k) with a ratio of 9:1. ## Field Description | Field Name | Field Description | Remarks | | --------------------------- | ------------------------------------------------------------------------------ | ------------------------------------- | | chosen | Same as the hh-rlhf dataset. The last line represents the chosen response, and the preceding lines constitute the dialogue history | Type is a list. The dialogue history for both chosen and rejected responses is the same | | rejected | Same as the hh-rlhf dataset. The last line represents the chosen response, and the preceding lines constitute the dialogue history | Type is a list. The dialogue history for both chosen and rejected responses is the same | | GPT4 label | GPT-4 annotation for preference pairs; 1 indicates GPT-4 prefers chosen, 0 indicates GPT-4 prefers rejected, and -1 indicates that the label does not exist | Only present in the valid set | | mean preference difference | Metric measuring preference strength as discussed in the paper; absolute value indicates the magnitude, and positive/negative indicates preference for chosen or rejected, respectively | Average of preference strengths across N models | | std preference difference | Metric measuring uncertainty in preference strength, representing the standard deviation among preference strengths from different models | Standard deviation of preference strengths across N models | | chosen score list | List of scores given by N models for the chosen option in each preference pair | Type is a list, each element represents the score given by a single model | | rejected score list | List of scores given by N models for the rejected option in each preference pair | Type is a list, each element represents the score given by a single model |

# 数据集卡片：hh-rlhf-strength-cleaned **其他语言版本：[英文](README.md)、[中文](README_zh.md)。** ## 数据集说明在题为《大语言模型中基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）的奥秘第二部分：奖励建模》(https://arxiv.org/abs/2401.06080)的论文中，我们通过模型集成（model ensemble）测量了[hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)数据集中每一组偏好配对的偏好强度，并使用GPT-4对验证集进行了标注。本仓库提供如下内容： 1. 训练集与验证集的偏好强度元数据 2. 验证集的GPT-4标注结果我们对原hh-rlhf数据集进行了合并，并按照9:1的比例重新划分为训练集（151k条）与验证集（17k条）。 ## 字段说明 | 字段名 | 字段描述 | 备注 | | --------------------- | ------------------------------------------------------------------------ | ------------------------------------- | | chosen | 与hh-rlhf数据集一致。最后一行代表被选中的回复，此前的行构成对话历史 | 类型为列表。被选中回复与被拒绝回复的对话历史完全一致 | | rejected | 与hh-rlhf数据集一致。最后一行代表被拒绝的回复，此前的行构成对话历史 | 类型为列表。被选中回复与被拒绝回复的对话历史完全一致 | | GPT4标签 | 偏好配对的GPT-4标注结果；1代表GPT-4偏好被选中的回复，0代表GPT-4偏好被拒绝的回复，-1代表标签不存在 | 仅在验证集中出现 | | 平均偏好差异 | 本文中提及的衡量偏好强度的指标；绝对值代表强度大小，正负分别代表对被选中回复或被拒绝回复的偏好 | N个模型的偏好强度的平均值 | | 偏好差异标准差 | 衡量偏好强度不确定性的指标，代表不同模型给出的偏好强度的标准差 | N个模型的偏好强度的标准差 | | 被选中回复得分列表 | 每个偏好配对中，N个模型为被选中回复给出的得分列表 | 类型为列表，每个元素代表单个模型给出的得分 | | 被拒绝回复得分列表 | 每个偏好配对中，N个模型为被拒绝回复给出的得分列表 | 类型为列表，每个元素代表单个模型给出的得分 |

提供机构：

maas

创建时间：

2025-10-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集