andrewsiah/rewarded-Eurus-RM-7b_s126_e251
收藏Hugging Face2024-05-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/andrewsiah/rewarded-Eurus-RM-7b_s126_e251
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: reward_1
dtype: float64
- name: reward_2
dtype: float64
- name: reward_3
dtype: float64
- name: reward_4
dtype: float64
- name: reward_5
dtype: float64
- name: reward_6
dtype: float64
- name: reward_7
dtype: float64
- name: reward_8
dtype: float64
- name: reward_9
dtype: float64
- name: reward_10
dtype: float64
- name: reward_11
dtype: float64
- name: reward_12
dtype: float64
- name: reward_13
dtype: float64
- name: reward_14
dtype: float64
- name: reward_15
dtype: float64
- name: reward_16
dtype: float64
- name: reward_17
dtype: float64
- name: reward_18
dtype: float64
- name: reward_19
dtype: float64
- name: reward_20
dtype: float64
- name: reward_21
dtype: float64
- name: reward_22
dtype: float64
- name: reward_23
dtype: float64
- name: reward_24
dtype: float64
- name: reward_25
dtype: float64
- name: reward_26
dtype: float64
- name: reward_27
dtype: float64
- name: reward_28
dtype: float64
- name: reward_29
dtype: float64
- name: reward_30
dtype: float64
- name: reward_31
dtype: float64
- name: reward_32
dtype: float64
- name: reward_33
dtype: float64
- name: reward_34
dtype: float64
- name: reward_35
dtype: float64
- name: reward_36
dtype: float64
- name: reward_37
dtype: float64
- name: reward_38
dtype: float64
- name: reward_39
dtype: float64
- name: reward_40
dtype: float64
- name: reward_41
dtype: float64
- name: reward_42
dtype: float64
- name: reward_43
dtype: float64
- name: reward_44
dtype: float64
- name: reward_45
dtype: float64
- name: reward_46
dtype: float64
- name: reward_47
dtype: float64
- name: reward_48
dtype: float64
- name: reward_49
dtype: float64
- name: reward_50
dtype: float64
- name: reward_51
dtype: float64
- name: reward_52
dtype: float64
- name: reward_53
dtype: float64
- name: reward_54
dtype: float64
- name: reward_55
dtype: float64
- name: reward_56
dtype: float64
- name: reward_57
dtype: float64
- name: reward_58
dtype: float64
- name: reward_59
dtype: float64
- name: reward_60
dtype: float64
- name: reward_61
dtype: float64
- name: reward_62
dtype: float64
- name: reward_63
dtype: float64
- name: reward_64
dtype: float64
- name: reward_65
dtype: float64
- name: reward_66
dtype: float64
- name: reward_67
dtype: float64
- name: reward_68
dtype: float64
- name: reward_69
dtype: float64
- name: reward_70
dtype: float64
- name: reward_71
dtype: float64
- name: reward_72
dtype: float64
- name: reward_73
dtype: float64
- name: reward_74
dtype: float64
- name: reward_75
dtype: float64
- name: reward_76
dtype: float64
- name: reward_77
dtype: float64
- name: reward_78
dtype: float64
- name: reward_79
dtype: float64
- name: reward_80
dtype: float64
- name: reward_81
dtype: float64
- name: reward_82
dtype: float64
- name: reward_83
dtype: float64
- name: reward_84
dtype: float64
- name: reward_85
dtype: float64
- name: reward_86
dtype: float64
- name: reward_87
dtype: float64
- name: reward_88
dtype: float64
- name: reward_89
dtype: float64
- name: reward_90
dtype: float64
- name: reward_91
dtype: float64
- name: reward_92
dtype: float64
- name: reward_93
dtype: float64
- name: reward_94
dtype: float64
- name: reward_95
dtype: float64
- name: reward_96
dtype: float64
- name: reward_97
dtype: float64
- name: reward_98
dtype: float64
- name: reward_99
dtype: float64
- name: reward_100
dtype: float64
- name: prompt
dtype: string
- name: subset
dtype: string
- name: rewardbench_chosen
dtype: string
- name: rewardbench_chosen_model
dtype: string
- name: rewardbench_rejected
dtype: string
- name: rewardbench_rejected_model
dtype: string
- name: response_1
dtype: string
- name: response_1_model
dtype: string
- name: response_2
dtype: string
- name: response_2_model
dtype: string
- name: response_3
dtype: string
- name: response_3_model
dtype: string
- name: response_4
dtype: string
- name: response_4_model
dtype: string
- name: response_5
dtype: string
- name: response_5_model
dtype: string
- name: response_6
dtype: string
- name: response_6_model
dtype: string
- name: response_7
dtype: string
- name: response_7_model
dtype: string
- name: response_8
dtype: string
- name: response_8_model
dtype: string
- name: response_9
dtype: string
- name: response_9_model
dtype: string
- name: response_10
dtype: string
- name: response_10_model
dtype: string
- name: response_11
dtype: string
- name: response_11_model
dtype: string
- name: response_12
dtype: string
- name: response_12_model
dtype: string
- name: response_13
dtype: string
- name: response_13_model
dtype: string
- name: response_14
dtype: string
- name: response_14_model
dtype: string
- name: response_15
dtype: string
- name: response_15_model
dtype: string
- name: response_16
dtype: string
- name: response_16_model
dtype: string
- name: response_17
dtype: string
- name: response_17_model
dtype: string
- name: response_18
dtype: string
- name: response_18_model
dtype: string
- name: response_19
dtype: string
- name: response_19_model
dtype: string
- name: response_20
dtype: string
- name: response_20_model
dtype: string
- name: response_21
dtype: string
- name: response_21_model
dtype: string
- name: response_22
dtype: string
- name: response_22_model
dtype: string
- name: response_23
dtype: string
- name: response_23_model
dtype: string
- name: response_24
dtype: string
- name: response_24_model
dtype: string
- name: response_25
dtype: string
- name: response_25_model
dtype: string
- name: response_26
dtype: string
- name: response_26_model
dtype: string
- name: response_27
dtype: string
- name: response_27_model
dtype: string
- name: response_28
dtype: string
- name: response_28_model
dtype: string
- name: response_29
dtype: string
- name: response_29_model
dtype: string
- name: response_30
dtype: string
- name: response_30_model
dtype: string
- name: response_31
dtype: string
- name: response_31_model
dtype: string
- name: response_32
dtype: string
- name: response_32_model
dtype: string
- name: response_33
dtype: string
- name: response_33_model
dtype: string
- name: response_34
dtype: string
- name: response_34_model
dtype: string
- name: response_35
dtype: string
- name: response_35_model
dtype: string
- name: response_36
dtype: string
- name: response_36_model
dtype: string
- name: response_37
dtype: string
- name: response_37_model
dtype: string
- name: response_38
dtype: string
- name: response_38_model
dtype: string
- name: response_39
dtype: string
- name: response_39_model
dtype: string
- name: response_40
dtype: string
- name: response_40_model
dtype: string
- name: response_41
dtype: string
- name: response_41_model
dtype: string
- name: response_42
dtype: string
- name: response_42_model
dtype: string
- name: response_43
dtype: string
- name: response_43_model
dtype: string
- name: response_44
dtype: string
- name: response_44_model
dtype: string
- name: response_45
dtype: string
- name: response_45_model
dtype: string
- name: response_46
dtype: string
- name: response_46_model
dtype: string
- name: response_47
dtype: string
- name: response_47_model
dtype: string
- name: response_48
dtype: string
- name: response_48_model
dtype: string
- name: response_49
dtype: string
- name: response_49_model
dtype: string
- name: response_50
dtype: string
- name: response_50_model
dtype: string
- name: response_51
dtype: string
- name: response_51_model
dtype: string
- name: response_52
dtype: string
- name: response_52_model
dtype: string
- name: response_53
dtype: string
- name: response_53_model
dtype: string
- name: response_54
dtype: string
- name: response_54_model
dtype: string
- name: response_55
dtype: string
- name: response_55_model
dtype: string
- name: response_56
dtype: string
- name: response_56_model
dtype: string
- name: response_57
dtype: string
- name: response_57_model
dtype: string
- name: response_58
dtype: string
- name: response_58_model
dtype: string
- name: response_59
dtype: string
- name: response_59_model
dtype: string
- name: response_60
dtype: string
- name: response_60_model
dtype: string
- name: response_61
dtype: string
- name: response_61_model
dtype: string
- name: response_62
dtype: string
- name: response_62_model
dtype: string
- name: response_63
dtype: string
- name: response_63_model
dtype: string
- name: response_64
dtype: string
- name: response_64_model
dtype: string
- name: response_65
dtype: string
- name: response_65_model
dtype: string
- name: response_66
dtype: string
- name: response_66_model
dtype: string
- name: response_67
dtype: string
- name: response_67_model
dtype: string
- name: response_68
dtype: string
- name: response_68_model
dtype: string
- name: response_69
dtype: string
- name: response_69_model
dtype: string
- name: response_70
dtype: string
- name: response_70_model
dtype: string
- name: response_71
dtype: string
- name: response_71_model
dtype: string
- name: response_72
dtype: string
- name: response_72_model
dtype: string
- name: response_73
dtype: string
- name: response_73_model
dtype: string
- name: response_74
dtype: string
- name: response_74_model
dtype: string
- name: response_75
dtype: string
- name: response_75_model
dtype: string
- name: response_76
dtype: string
- name: response_76_model
dtype: string
- name: response_77
dtype: string
- name: response_77_model
dtype: string
- name: response_78
dtype: string
- name: response_78_model
dtype: string
- name: response_79
dtype: string
- name: response_79_model
dtype: string
- name: response_80
dtype: string
- name: response_80_model
dtype: string
- name: response_81
dtype: string
- name: response_81_model
dtype: string
- name: response_82
dtype: string
- name: response_82_model
dtype: string
- name: response_83
dtype: string
- name: response_83_model
dtype: string
- name: response_84
dtype: string
- name: response_84_model
dtype: string
- name: response_85
dtype: string
- name: response_85_model
dtype: string
- name: response_86
dtype: string
- name: response_86_model
dtype: string
- name: response_87
dtype: string
- name: response_87_model
dtype: string
- name: response_88
dtype: string
- name: response_88_model
dtype: string
- name: response_89
dtype: string
- name: response_89_model
dtype: string
- name: response_90
dtype: string
- name: response_90_model
dtype: string
- name: response_91
dtype: string
- name: response_91_model
dtype: string
- name: response_92
dtype: string
- name: response_92_model
dtype: string
- name: response_93
dtype: string
- name: response_93_model
dtype: string
- name: response_94
dtype: string
- name: response_94_model
dtype: string
- name: response_95
dtype: string
- name: response_95_model
dtype: string
- name: response_96
dtype: string
- name: response_96_model
dtype: string
- name: response_97
dtype: string
- name: response_97_model
dtype: string
- name: response_98
dtype: string
- name: response_98_model
dtype: string
- name: response_99
dtype: string
- name: response_99_model
dtype: string
- name: response_100
dtype: string
- name: response_100_model
dtype: string
- name: rformatted_prompt_response_1
dtype: string
- name: rformatted_prompt_response_2
dtype: string
- name: rformatted_prompt_response_3
dtype: string
- name: rformatted_prompt_response_4
dtype: string
- name: rformatted_prompt_response_5
dtype: string
- name: rformatted_prompt_response_6
dtype: string
- name: rformatted_prompt_response_7
dtype: string
- name: rformatted_prompt_response_8
dtype: string
- name: rformatted_prompt_response_9
dtype: string
- name: rformatted_prompt_response_10
dtype: string
- name: rformatted_prompt_response_11
dtype: string
- name: rformatted_prompt_response_12
dtype: string
- name: rformatted_prompt_response_13
dtype: string
- name: rformatted_prompt_response_14
dtype: string
- name: rformatted_prompt_response_15
dtype: string
- name: rformatted_prompt_response_16
dtype: string
- name: rformatted_prompt_response_17
dtype: string
- name: rformatted_prompt_response_18
dtype: string
- name: rformatted_prompt_response_19
dtype: string
- name: rformatted_prompt_response_20
dtype: string
- name: rformatted_prompt_response_21
dtype: string
- name: rformatted_prompt_response_22
dtype: string
- name: rformatted_prompt_response_23
dtype: string
- name: rformatted_prompt_response_24
dtype: string
- name: rformatted_prompt_response_25
dtype: string
- name: rformatted_prompt_response_26
dtype: string
- name: rformatted_prompt_response_27
dtype: string
- name: rformatted_prompt_response_28
dtype: string
- name: rformatted_prompt_response_29
dtype: string
- name: rformatted_prompt_response_30
dtype: string
- name: rformatted_prompt_response_31
dtype: string
- name: rformatted_prompt_response_32
dtype: string
- name: rformatted_prompt_response_33
dtype: string
- name: rformatted_prompt_response_34
dtype: string
- name: rformatted_prompt_response_35
dtype: string
- name: rformatted_prompt_response_36
dtype: string
- name: rformatted_prompt_response_37
dtype: string
- name: rformatted_prompt_response_38
dtype: string
- name: rformatted_prompt_response_39
dtype: string
- name: rformatted_prompt_response_40
dtype: string
- name: rformatted_prompt_response_41
dtype: string
- name: rformatted_prompt_response_42
dtype: string
- name: rformatted_prompt_response_43
dtype: string
- name: rformatted_prompt_response_44
dtype: string
- name: rformatted_prompt_response_45
dtype: string
- name: rformatted_prompt_response_46
dtype: string
- name: rformatted_prompt_response_47
dtype: string
- name: rformatted_prompt_response_48
dtype: string
- name: rformatted_prompt_response_49
dtype: string
- name: rformatted_prompt_response_50
dtype: string
- name: rformatted_prompt_response_51
dtype: string
- name: rformatted_prompt_response_52
dtype: string
- name: rformatted_prompt_response_53
dtype: string
- name: rformatted_prompt_response_54
dtype: string
- name: rformatted_prompt_response_55
dtype: string
- name: rformatted_prompt_response_56
dtype: string
- name: rformatted_prompt_response_57
dtype: string
- name: rformatted_prompt_response_58
dtype: string
- name: rformatted_prompt_response_59
dtype: string
- name: rformatted_prompt_response_60
dtype: string
- name: rformatted_prompt_response_61
dtype: string
- name: rformatted_prompt_response_62
dtype: string
- name: rformatted_prompt_response_63
dtype: string
- name: rformatted_prompt_response_64
dtype: string
- name: rformatted_prompt_response_65
dtype: string
- name: rformatted_prompt_response_66
dtype: string
- name: rformatted_prompt_response_67
dtype: string
- name: rformatted_prompt_response_68
dtype: string
- name: rformatted_prompt_response_69
dtype: string
- name: rformatted_prompt_response_70
dtype: string
- name: rformatted_prompt_response_71
dtype: string
- name: rformatted_prompt_response_72
dtype: string
- name: rformatted_prompt_response_73
dtype: string
- name: rformatted_prompt_response_74
dtype: string
- name: rformatted_prompt_response_75
dtype: string
- name: rformatted_prompt_response_76
dtype: string
- name: rformatted_prompt_response_77
dtype: string
- name: rformatted_prompt_response_78
dtype: string
- name: rformatted_prompt_response_79
dtype: string
- name: rformatted_prompt_response_80
dtype: string
- name: rformatted_prompt_response_81
dtype: string
- name: rformatted_prompt_response_82
dtype: string
- name: rformatted_prompt_response_83
dtype: string
- name: rformatted_prompt_response_84
dtype: string
- name: rformatted_prompt_response_85
dtype: string
- name: rformatted_prompt_response_86
dtype: string
- name: rformatted_prompt_response_87
dtype: string
- name: rformatted_prompt_response_88
dtype: string
- name: rformatted_prompt_response_89
dtype: string
- name: rformatted_prompt_response_90
dtype: string
- name: rformatted_prompt_response_91
dtype: string
- name: rformatted_prompt_response_92
dtype: string
- name: rformatted_prompt_response_93
dtype: string
- name: rformatted_prompt_response_94
dtype: string
- name: rformatted_prompt_response_95
dtype: string
- name: rformatted_prompt_response_96
dtype: string
- name: rformatted_prompt_response_97
dtype: string
- name: rformatted_prompt_response_98
dtype: string
- name: rformatted_prompt_response_99
dtype: string
- name: rformatted_prompt_response_100
dtype: string
splits:
- name: train
num_bytes: 43484992
num_examples: 125
download_size: 23102703
dataset_size: 43484992
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
提供机构:
andrewsiah
原始信息汇总
数据集概述
数据集特征
数据集包含以下特征:
-
奖励特征:
- 共有100个奖励特征,命名为
reward_1至reward_100,每个特征的数据类型均为float64。
- 共有100个奖励特征,命名为
-
其他特征:
prompt: 数据类型为string。subset: 数据类型为string。rewardbench_chosen: 数据类型为string。rewardbench_chosen_model: 数据类型为string。rewardbench_rejected: 数据类型为string。rewardbench_rejected_model: 数据类型为string。response_1至response_100及其对应的模型标识response_1_model至response_100_model,每个响应及其模型的数据类型均为string。rformatted_prompt_response_1至rformatted_prompt_response_100,每个响应的数据类型均为string。
数据集分割
- 训练集:
- 名称:
train - 大小:
- 字节数:43484992
- 示例数:125
- 下载大小:23102703字节
- 数据集大小:43484992字节
- 名称:
搜集汇总
数据集介绍

构建方式
在强化学习与对齐技术蓬勃发展的背景下,该数据集通过系统化的流程构建而成。其核心源于RewardBench基准,首先收集了涵盖多种场景的提示词(prompt),并针对每个提示生成了由不同模型提供的多达100个候选响应。随后,利用一个经过专门训练的奖励模型(Eurus-RM-7b)对这些响应进行自动化评估,为每一个响应赋予一个量化的奖励分数。这一过程确保了数据生成的规模化与一致性,最终形成了包含提示、多模型响应及其对应奖励分数的结构化数据集。
特点
该数据集最显著的特征在于其高维度的奖励信号标注与丰富的响应多样性。每条数据记录不仅包含原始提示,还囊括了来自不同模型的100个文本响应,并为每个响应配备了精确的浮点数奖励值,构成了一个密集的评分矩阵。这种结构为研究奖励模型的判别粒度、不同模型生成质量的分布比较以及偏好对齐的微观机制提供了前所未有的细粒度数据。数据集同时保留了响应来源的模型标识,便于进行基于模型架构的性能溯源分析。
使用方法
该数据集主要服务于奖励模型训练、强化学习策略优化以及大语言模型对齐的学术研究。使用者可以通过HuggingFace库直接加载,将数据解析为提示、响应列表与奖励分数列表的三元组。在具体应用中,研究人员可利用这些数据训练或微调新的奖励模型,通过监督学习拟合已有的评分模式。亦可将其作为离线强化学习的经验池,用于训练策略模型学习生成高奖励响应。此外,通过分析奖励分数在不同响应和模型间的分布,能够深入评估和比较各类语言模型的生成偏好与质量。
背景与挑战
背景概述
在强化学习与人类反馈对齐的学术前沿,奖励模型作为优化大语言模型行为的关键组件,其性能评估依赖于高质量的数据集。数据集andrewsiah/rewarded-Eurus-RM-7b_s126_e251应运而生,由研究人员Andrew Siah于近期构建,旨在通过系统化的奖励分数标注,为奖励模型的训练与基准测试提供结构化支持。该数据集的核心研究问题聚焦于如何量化评估不同模型对多样化提示的响应质量,从而推动对齐技术向更精准、可解释的方向演进。其影响力体现在为社区提供了可复现的评估框架,加速了安全、可靠人工智能系统的开发进程。
当前挑战
该数据集致力于解决奖励模型评估中的核心挑战,即如何建立统一、客观的度量标准以比较不同模型的输出优劣。构建过程中,首要挑战在于设计覆盖广泛场景的提示与响应对,确保评估的全面性与代表性。其次,为每个响应分配精确的奖励分数需克服主观偏差,要求标注过程具备高度一致性与可验证性。此外,整合多模型输出并维持数据结构的完整性,对数据处理流程提出了严峻的技术要求,任何环节的疏漏都可能影响最终评估的可靠性。
常用场景
经典使用场景
在强化学习与人类反馈对齐的学术探索中,该数据集以其丰富的奖励模型评分与多样化模型响应,为奖励模型的训练与评估提供了经典范例。数据集通过整合多个大语言模型对相同提示的差异化输出,并辅以系统化的奖励分数标注,构建了一个多维度、可量化的对比评估框架。这一框架使得研究者能够深入分析不同模型在安全性、有用性、诚实性等对齐维度上的表现差异,为奖励模型的优化与迭代奠定了数据基础。
实际应用
在实际应用中,该数据集可直接服务于大语言模型的微调与安全对齐工程。开发团队可利用其标注的奖励分数,训练或校准专属的奖励模型,进而指导策略模型生成更符合人类价值观的响应。在内容审核、智能助手对话优化以及高风险领域(如医疗、法律咨询)的AI应用部署前,该数据集提供的多模型响应对比与评分,可作为模型行为安全审计与性能基准测试的重要参考依据。
衍生相关工作
围绕该数据集衍生的经典工作,主要集中在奖励模型架构创新与对齐算法改进方面。研究者利用其细粒度奖励信号,开发了更高效的偏好建模方法,如基于 Bradley-Terry 模型的扩展或直接策略优化算法的改进。同时,该数据集也催生了一系列关于奖励模型泛化能力、奖励黑客(reward hacking)现象检测以及多目标对齐权衡的研究,推动了将人类复杂价值判断编码为可学习奖励函数的技术前沿。
以上内容由遇见数据集搜集并总结生成



