hh-rlhf-strength-cleaned
收藏魔搭社区2025-12-05 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/openmoss/hh-rlhf-strength-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for hh-rlhf-strength-cleaned
**Other Language Versions: [English](README.md), [中文](README_zh.md).**
## Dataset Description
In the paper titled "[Secrets of RLHF in Large Language Models Part II: Reward Modeling](https://arxiv.org/abs/2401.06080)" we measured the preference strength of each preference pair in the [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf "https://huggingface.co/datasets/Anthropic/hh-rlhf") dataset through model ensemble and annotated the valid set with GPT-4. In this repository, we provide:
1. Metadata of preference strength for both the training and valid sets.
2. GPT-4 annotations on the valid set.
We mixed the hh-rlhf dataset and redivided it into a training set (151k) and a valid set (17k) with a ratio of 9:1.
## Field Description
| Field Name | Field Description | Remarks |
| --------------------------- | ------------------------------------------------------------------------------ | ------------------------------------- |
| chosen | Same as the hh-rlhf dataset. The last line represents the chosen response, and the preceding lines constitute the dialogue history | Type is a list. The dialogue history for both chosen and rejected responses is the same |
| rejected | Same as the hh-rlhf dataset. The last line represents the chosen response, and the preceding lines constitute the dialogue history | Type is a list. The dialogue history for both chosen and rejected responses is the same |
| GPT4 label | GPT-4 annotation for preference pairs; 1 indicates GPT-4 prefers chosen, 0 indicates GPT-4 prefers rejected, and -1 indicates that the label does not exist | Only present in the valid set |
| mean preference difference | Metric measuring preference strength as discussed in the paper; absolute value indicates the magnitude, and positive/negative indicates preference for chosen or rejected, respectively | Average of preference strengths across N models |
| std preference difference | Metric measuring uncertainty in preference strength, representing the standard deviation among preference strengths from different models | Standard deviation of preference strengths across N models |
| chosen score list | List of scores given by N models for the chosen option in each preference pair | Type is a list, each element represents the score given by a single model |
| rejected score list | List of scores given by N models for the rejected option in each preference pair | Type is a list, each element represents the score given by a single model |
# 数据集卡片:hh-rlhf-strength-cleaned
**其他语言版本:[英文](README.md)、[中文](README_zh.md)。**
## 数据集说明
在题为《大语言模型中基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)的奥秘 第二部分:奖励建模》(https://arxiv.org/abs/2401.06080)的论文中,我们通过模型集成(model ensemble)测量了[hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)数据集中每一组偏好配对的偏好强度,并使用GPT-4对验证集进行了标注。本仓库提供如下内容:
1. 训练集与验证集的偏好强度元数据
2. 验证集的GPT-4标注结果
我们对原hh-rlhf数据集进行了合并,并按照9:1的比例重新划分为训练集(151k条)与验证集(17k条)。
## 字段说明
| 字段名 | 字段描述 | 备注 |
| --------------------- | ------------------------------------------------------------------------ | ------------------------------------- |
| chosen | 与hh-rlhf数据集一致。最后一行代表被选中的回复,此前的行构成对话历史 | 类型为列表。被选中回复与被拒绝回复的对话历史完全一致 |
| rejected | 与hh-rlhf数据集一致。最后一行代表被拒绝的回复,此前的行构成对话历史 | 类型为列表。被选中回复与被拒绝回复的对话历史完全一致 |
| GPT4标签 | 偏好配对的GPT-4标注结果;1代表GPT-4偏好被选中的回复,0代表GPT-4偏好被拒绝的回复,-1代表标签不存在 | 仅在验证集中出现 |
| 平均偏好差异 | 本文中提及的衡量偏好强度的指标;绝对值代表强度大小,正负分别代表对被选中回复或被拒绝回复的偏好 | N个模型的偏好强度的平均值 |
| 偏好差异标准差 | 衡量偏好强度不确定性的指标,代表不同模型给出的偏好强度的标准差 | N个模型的偏好强度的标准差 |
| 被选中回复得分列表 | 每个偏好配对中,N个模型为被选中回复给出的得分列表 | 类型为列表,每个元素代表单个模型给出的得分 |
| 被拒绝回复得分列表 | 每个偏好配对中,N个模型为被拒绝回复给出的得分列表 | 类型为列表,每个元素代表单个模型给出的得分 |
提供机构:
maas
创建时间:
2025-10-23



