OpenRLHF/preference_dataset_mixture2_and_safe_pku

Name: OpenRLHF/preference_dataset_mixture2_and_safe_pku
Creator: OpenRLHF
Published: 2024-06-14 11:28:03
License: 暂无描述

Hugging Face2024-06-14 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/OpenRLHF/preference_dataset_mixture2_and_safe_pku

下载链接

链接失效反馈

官方服务：

资源简介：

--- {} --- > Copy from https://huggingface.co/datasets/weqweasdas/preference_dataset_mixture2_and_safe_pku # Reward Model Overview  This is the data mixture used for the reward model weqweasdas/RM-Mistral-7B, trained with the script https://github.com/WeiXiongUST/RLHF-Reward-Modeling . Also see a short blog for the training details (data mixture, parameters...): https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0 ## Model Details If you have any question with this reward model and also any question about reward modeling, feel free to drop me an email with wx13@illinois.edu. I would be happy to chat! ### Dataset preprocessing  The model is trained on a mixture of the following datasets. - [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) - [SHP](https://huggingface.co/datasets/stanfordnlp/SHP) - [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) - [Capybara](argilla/distilabel-capybara-dpo-7k-binarized) - [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer) - [Orca](argilla/distilabel-intel-orca-dpo-pairs) - [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) Difference between this mixture and the original dataset - HH-RLHF: we only use the helpful subset and we delete the noisy samples where chosen_response == rejected_response; - SHP: we only use the samples with score ratio > 2, for each prompt, we take 5 comparison at most, leading to 109526; - Ultrafeedback: similar to UltraFeedback-Binarized, we use the fine-grained score instead of the overall one to rank samples. Meanwhile, for each prompt, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 267416. - HelpSteer: we use the mean of helpfulness and correctness to rank samples. Meanwhile, we take all possible 6 pairs of comparisons. Finally, we delete the selected pairs with equal scores, leading to 21576;

This dataset is a mixture used for training the reward model weqweasdas/RM-Mistral-7B, consisting of several preprocessed datasets including HH-RLHF, SHP, UltraFeedback, Capybara, HelpSteer, Orca, and PKU-Alignment/PKU-SafeRLHF-30K. Each dataset is carefully selected and processed to enhance data quality and model performance.

提供机构：

OpenRLHF

原始信息汇总

数据集概述

数据集组成

该数据集是用于训练奖励模型 weqweasdas/RM-Mistral-7B 的数据混合体，包含以下子数据集：

数据预处理

HH-RLHF: 仅使用有帮助的子集，并删除 chosen_response == rejected_response 的噪声样本。
SHP: 仅使用评分比率大于2的样本，每个提示最多取5个比较，最终得到109526个样本。
UltraFeedback: 使用细粒度评分而非总体评分来排序样本，每个提示取所有可能的6对比较，删除评分相等的选定对，最终得到267416个样本。
HelpSteer: 使用帮助性和正确性的平均值来排序样本，每个提示取所有可能的6对比较，删除评分相等的选定对，最终得到21576个样本。

搜集汇总

数据集介绍

构建方式

OpenRLHF/preference_dataset_mixture2_and_safe_pku数据集是由多个数据源混合构建而成，旨在为奖励模型训练提供多样化的输入。该数据集整合了HH-RLHF、SHP、UltraFeedback等七个数据集，通过特定的预处理步骤，如选择样本子集、删除噪声样本、采用细粒度评分等手段，确保数据质量与适用性。

使用方法

使用该数据集时，用户可以将其作为训练奖励模型的基础数据。数据集的加载和预处理可以通过HuggingFace的库来实现，具体的使用方法可以参照相关的训练脚本和文档。同时，用户在使用过程中如有疑问，可以通过提供的电子邮件地址寻求帮助，确保数据集能够被有效地应用于模型训练之中。

背景与挑战

背景概述

OpenRLHF/preference_dataset_mixture2_and_safe_pku数据集，是在人工智能领域中对奖励模型进行研究的重要资源。该数据集由weqweasdas/RM-Mistral-7B奖励模型训练时所使用，创建于近年来，具体时间不详。主要研究人员为WeiXiongUST，相关研究机构为伊利诺伊大学。该数据集的核心研究问题是优化奖励模型，以提高强化学习中的决策质量。通过对多个数据集的混合使用，该数据集在推动相关领域研究中发挥了显著作用。

当前挑战

该数据集面临的挑战主要包括两个方面：一是其在解决领域问题，如优化奖励模型的过程中，需要处理不同数据集间的异构性和数据质量不一的问题；二是构建过程中，对HH-RLHF、SHP、UltraFeedback等原始数据集进行了严格的筛选和预处理，如去除噪声样本、选择高评分比样本、使用细粒度分数等，这些步骤增加了数据集构建的复杂性和对数据处理能力的要求。

常用场景

经典使用场景

在深度强化学习领域中，OpenRLHF/preference_dataset_mixture2_and_safe_pku数据集被广泛用于训练评价模型，其经典的使用场景在于通过对比不同响应的偏好数据，以学习到评价响应优劣的奖励函数。该数据集结合了多个来源的偏好数据，为模型提供了丰富的学习素材，使其能更准确地量化响应的质量。

解决学术问题

该数据集有效解决了评价模型训练中的数据多样性和质量筛选问题。通过精心筛选和混合多个数据集，它不仅增加了样本的多样性，还通过去除噪声样本和平衡样本比较，提高了数据质量，从而促进了评价模型在理解人类偏好方面的性能提升。

实际应用

实际应用中，该数据集被用于构建能够理解人类偏好并对生成响应进行评价的人工智能系统。在聊天机器人、推荐系统和内容审核等多个领域，这样的评价模型可以帮助系统更好地适应和满足用户的需求。

数据集最近研究