five

hh-rlhf-strength-cleaned

收藏
魔搭社区2025-12-05 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/openmoss/hh-rlhf-strength-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for hh-rlhf-strength-cleaned **Other Language Versions: [English](README.md), [中文](README_zh.md).** ## Dataset Description In the paper titled "[Secrets of RLHF in Large Language Models Part II: Reward Modeling](https://arxiv.org/abs/2401.06080)" we measured the preference strength of each preference pair in the [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf "https://huggingface.co/datasets/Anthropic/hh-rlhf") dataset through model ensemble and annotated the valid set with GPT-4. In this repository, we provide: 1. Metadata of preference strength for both the training and valid sets. 2. GPT-4 annotations on the valid set. We mixed the hh-rlhf dataset and redivided it into a training set (151k) and a valid set (17k) with a ratio of 9:1. ## Field Description | Field Name | Field Description | Remarks | | --------------------------- | ------------------------------------------------------------------------------ | ------------------------------------- | | chosen | Same as the hh-rlhf dataset. The last line represents the chosen response, and the preceding lines constitute the dialogue history | Type is a list. The dialogue history for both chosen and rejected responses is the same | | rejected | Same as the hh-rlhf dataset. The last line represents the chosen response, and the preceding lines constitute the dialogue history | Type is a list. The dialogue history for both chosen and rejected responses is the same | | GPT4 label | GPT-4 annotation for preference pairs; 1 indicates GPT-4 prefers chosen, 0 indicates GPT-4 prefers rejected, and -1 indicates that the label does not exist | Only present in the valid set | | mean preference difference | Metric measuring preference strength as discussed in the paper; absolute value indicates the magnitude, and positive/negative indicates preference for chosen or rejected, respectively | Average of preference strengths across N models | | std preference difference | Metric measuring uncertainty in preference strength, representing the standard deviation among preference strengths from different models | Standard deviation of preference strengths across N models | | chosen score list | List of scores given by N models for the chosen option in each preference pair | Type is a list, each element represents the score given by a single model | | rejected score list | List of scores given by N models for the rejected option in each preference pair | Type is a list, each element represents the score given by a single model |

# 数据集卡片:hh-rlhf-strength-cleaned **其他语言版本:[英文](README.md)、[中文](README_zh.md)。** ## 数据集说明 在题为《大语言模型中基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)的奥秘 第二部分:奖励建模》(https://arxiv.org/abs/2401.06080)的论文中,我们通过模型集成(model ensemble)测量了[hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)数据集中每一组偏好配对的偏好强度,并使用GPT-4对验证集进行了标注。本仓库提供如下内容: 1. 训练集与验证集的偏好强度元数据 2. 验证集的GPT-4标注结果 我们对原hh-rlhf数据集进行了合并,并按照9:1的比例重新划分为训练集(151k条)与验证集(17k条)。 ## 字段说明 | 字段名 | 字段描述 | 备注 | | --------------------- | ------------------------------------------------------------------------ | ------------------------------------- | | chosen | 与hh-rlhf数据集一致。最后一行代表被选中的回复,此前的行构成对话历史 | 类型为列表。被选中回复与被拒绝回复的对话历史完全一致 | | rejected | 与hh-rlhf数据集一致。最后一行代表被拒绝的回复,此前的行构成对话历史 | 类型为列表。被选中回复与被拒绝回复的对话历史完全一致 | | GPT4标签 | 偏好配对的GPT-4标注结果;1代表GPT-4偏好被选中的回复,0代表GPT-4偏好被拒绝的回复,-1代表标签不存在 | 仅在验证集中出现 | | 平均偏好差异 | 本文中提及的衡量偏好强度的指标;绝对值代表强度大小,正负分别代表对被选中回复或被拒绝回复的偏好 | N个模型的偏好强度的平均值 | | 偏好差异标准差 | 衡量偏好强度不确定性的指标,代表不同模型给出的偏好强度的标准差 | N个模型的偏好强度的标准差 | | 被选中回复得分列表 | 每个偏好配对中,N个模型为被选中回复给出的得分列表 | 类型为列表,每个元素代表单个模型给出的得分 | | 被拒绝回复得分列表 | 每个偏好配对中,N个模型为被拒绝回复给出的得分列表 | 类型为列表,每个元素代表单个模型给出的得分 |
提供机构:
maas
创建时间:
2025-10-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作