five

Skywork-Reward-Preference-80K-v0.2

收藏
魔搭社区2026-04-28 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2
下载链接
链接失效反馈
官方服务:
资源简介:
# Skywork Reward Preference 80K > IMPORTANT: > This dataset is the decontaminated version of [Skywork-Reward-Preference-80K-v0.1](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1). We removed 4,957 pairs from the [magpie-ultra-v0.1](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1) subset that have a significant n-gram overlap with the evaluation prompts in [RewardBench](https://huggingface.co/datasets/allenai/reward-bench). You can find the set of removed pairs [here](https://huggingface.co/datasets/chrisliu298/Skywork-Reward-Preference-80K-v0.1-Contaminated). For more information, see [this GitHub gist](https://gist.github.com/natolambert/1aed306000c13e0e8c5bc17c1a5dd300). > > **If your task involves evaluation on [RewardBench](https://huggingface.co/datasets/allenai/reward-bench), we strongly encourage you to use v0.2 instead of v0.1 of the dataset.** > > We will soon release our new version of the reward models! Skywork Reward Preference 80K is a subset of 80K preference pairs, sourced from publicly available data. This subset is used to train [**Skywork-Reward-Gemma-2-27B-v0.2**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) and [**Skywork-Reward-Llama-3.1-8B-v0.2**](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2). ## Data Mixture We carefully curate the [Skywork Reward Data Collection](https://huggingface.co/collections/Skywork/skywork-reward-data-collection-66d7fda6a5098dc77035336d) (1) to include high-quality preference pairs and (2) to target specific capability and knowledge domains. The curated training dataset consists of approximately 80K samples, subsampled from multiple publicly available data sources, including 1. [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2) 2. [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias) 3. [WildGuard (adversarial)](https://huggingface.co/allenai/wildguard) 4. Magpie DPO series: [Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1), [Pro (Llama-3.1)](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1), [Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1), [Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1). **Disclaimer: We made no modifications to the original datasets listed above, other than subsampling the datasets to create the Skywork Reward Data Collection.** During dataset curation, we adopt several tricks to achieve both performance improvement and a balance between each domain, without compromising the overall performance: 1. We select top samples from math, code, and other categories in the combined Magpie dataset independently, based on the average ArmoRM score provided with the dataset. We subtract the ArmoRM average scores in the Magpie-Air subset and the Magpie-Pro subset by 0.1 and 0.05, respectively, to prioritize Magpie-Ultra and Magpie-Pro-Llama-3.1 samples. 2. Instead of including all preference pairs in WildGuard, we first train a reward model (RM) on three other data sources. We then (1) use this RM to score the chosen and rejected responses for all samples in WildGuard and (2) select only samples where the chosen response's RM score is greater than the rejected response's RM score. We observe that this approach largely preserves the original performance of Chat, Char hard, and Reasoning while improving Safety. For both models, we use the 27B model to score the WildGuard samples. ## Technical Report [Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs](https://arxiv.org/abs/2410.18451) ## Contact If you have any questions, please feel free to reach us at <yuhao.liuu@kunlun-inc.com> or <liang.zeng@kunlun-inc.com>. ## Citation If you find our work helpful, please feel free to cite us using the following BibTeX entry: ```bibtex @article{liu2024skywork, title={Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs}, author={Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui}, journal={arXiv preprint arXiv:2410.18451}, year={2024} } ```

# Skywork 偏好奖励数据集80K(Skywork Reward Preference 80K) > 重要提示: > 本数据集是[Skywork-Reward-Preference-80K-v0.1](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1)的去污染版本。我们从[magpie-ultra-v0.1](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1)子集中移除了4957条样本对,这些样本对与[RewardBench](https://huggingface.co/datasets/allenai/reward-bench)的评估提示存在显著的n-gram重叠。您可在[此处](https://huggingface.co/datasets/chrisliu298/Skywork-Reward-Preference-80K-v0.1-Contaminated)查看被移除的样本对集合。更多信息请参阅[此GitHub Gist](https://gist.github.com/natolambert/1aed306000c13e0e8c5bc17c1a5dd300)。 > > **若您的任务涉及使用RewardBench进行评估,我们强烈建议使用该数据集的v0.2版本,而非v0.1版本。** > > 我们即将推出全新版本的奖励模型! Skywork 偏好奖励数据集80K是一个包含80,000条偏好样本对的子集,其数据源自公开可用的数据源。该子集被用于训练[**Skywork-Reward-Gemma-2-27B-v0.2**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2)与[**Skywork-Reward-Llama-3.1-8B-v0.2**](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2)。 ## 数据集混合构成 我们精心整理了[Skywork奖励数据集合集](https://huggingface.co/collections/Skywork/skywork-reward-data-collection-66d7fda6a5098dc77035336d),其目标一是纳入高质量偏好样本对,二是覆盖特定能力与知识领域。经精选后的训练数据集约含80,000条样本,均从多个公开数据源中二次采样获取,包括: 1. [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2) 2. [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias) 3. [WildGuard(对抗型)数据集](https://huggingface.co/allenai/wildguard) 4. Magpie DPO系列:[Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1)、[Llama-3.1版本Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1)、[Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1)、[Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1)。 **免责声明:除为构建Skywork奖励数据集合集进行二次采样外,我们未对上述原始数据集进行任何修改。** 在数据集整理过程中,我们采用了多项技巧,在不损害整体性能的前提下实现性能提升与各领域样本分布的平衡: 1. 我们基于数据集附带的ArmoRM平均得分,分别从合并后的Magpie数据集的数学、代码及其他类别中筛选优质样本。为优先保留Magpie-Ultra与Magpie-Pro-Llama-3.1样本,我们分别将Magpie-Air子集与Magpie-Pro子集的ArmoRM平均得分下调0.1与0.05。 2. 我们未直接采用WildGuard的全部偏好样本对,而是先基于其他三个数据源训练一个奖励模型(Reward Model, RM)。随后,我们(1)使用该奖励模型为WildGuard中所有样本的选中回复与拒选回复计算得分,(2)仅保留选中回复的RM得分高于拒选回复的样本。我们发现该方法在提升安全性的同时,基本保留了Chat、Char Hard与Reasoning任务的原有性能。针对两个模型,我们均使用27B版本的模型对WildGuard样本进行打分。 ## 技术报告 [Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs](https://arxiv.org/abs/2410.18451) ## 联系方式 若您有任何疑问,可通过以下邮箱联系我们:<yuhao.liuu@kunlun-inc.com> 或 <liang.zeng@kunlun-inc.com>。 ## 引用 若您认为我们的工作对您有所帮助,请使用以下BibTeX条目进行引用: bibtex @article{liu2024skywork, title={Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs}, author={Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui}, journal={arXiv preprint arXiv:2410.18451}, year={2024} }
提供机构:
maas
创建时间:
2025-02-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作