Skywork-Reward-Preference-80K-v0.1
收藏魔搭社区2026-01-06 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1
下载链接
链接失效反馈官方服务:
资源简介:
# Skywork Reward Preference 80K
> IMPORTANT:
> This dataset shown to contain contaminated samples from the [magpie-ultra-v0.1](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1) subset. The prompts of those samples have a significant n-gram overlap with the evaluation prompts in [RewardBench](https://huggingface.co/datasets/allenai/reward-bench), based on the script in [this GitHub gist](https://gist.github.com/natolambert/1aed306000c13e0e8c5bc17c1a5dd300). You can find the set of removed pairs [here](https://huggingface.co/datasets/chrisliu298/Skywork-Reward-Preference-80K-v0.1-Contaminated).
>
> **If your task involves evaluation on [RewardBench](https://huggingface.co/datasets/allenai/reward-bench), we strongly encourage you to use [Skywork-Reward-Preference-80K-v0.2](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2) instead of v0.1 of the dataset.**
Skywork Reward Preference 80K is a subset of 80K preference pairs, sourced from publicly available data. This subset is used to train [**Skywork-Reward-Gemma-2-27B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B) and [**Skywork-Reward-Llama-3.1-8B**](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B).
## Data Mixture
We carefully curate the [Skywork Reward Data Collection](https://huggingface.co/collections/Skywork/skywork-reward-data-collection-66d7fda6a5098dc77035336d) (1) to include high-quality preference pairs and (2) to target specific capability and knowledge domains. The curated training dataset consists of approximately 80K samples, subsampled from multiple publicly available data sources, including
1. [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2)
2. [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias)
3. [WildGuard (adversarial)](https://huggingface.co/allenai/wildguard)
4. Magpie DPO series: [Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1), [Pro (Llama-3.1)](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1), [Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1), [Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1).
**Disclaimer: We made no modifications to the original datasets listed above, other than subsampling the datasets to create the Skywork Reward Data Collection.**
During dataset curation, we adopt several tricks to achieve both performance improvement and a balance between each domain, without compromising the overall performance:
1. We select top samples from math, code, and other categories in the combined Magpie dataset independently, based on the average ArmoRM score provided with the dataset. We subtract the ArmoRM average scores in the Magpie-Air subset and the Magpie-Pro subset by 0.1 and 0.05, respectively, to prioritize Magpie-Ultra and Magpie-Pro-Llama-3.1 samples.
2. Instead of including all preference pairs in WildGuard, we first train a reward model (RM) on three other data sources. We then (1) use this RM to score the chosen and rejected responses for all samples in WildGuard and (2) select only samples where the chosen response's RM score is greater than the rejected response's RM score. We observe that this approach largely preserves the original performance of Chat, Char hard, and Reasoning while improving Safety. For both models, we use the 27B model to score the WildGuard samples.
## Technical Report
[Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs](https://arxiv.org/abs/2410.18451)
## Contact
If you have any questions, please feel free to reach us at <yuhao.liuu@kunlun-inc.com> or <liang.zeng@kunlun-inc.com>.
## Citation
If you find our work helpful, please feel free to cite us using the following BibTeX entry:
```bibtex
@article{liu2024skywork,
title={Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs},
author={Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui},
journal={arXiv preprint arXiv:2410.18451},
year={2024}
}
```
# Skywork偏好奖励数据集80K
> 重要提示:
> 本数据集存在来自[magpie-ultra-v0.1](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1)子集的污染样本。根据[this GitHub gist](https://gist.github.com/natolambert/1aed306000c13e0e8c5bc17c1a5dd300)中的脚本分析,此类样本的提示词与[RewardBench](https://huggingface.co/datasets/allenai/reward-bench)中的评估提示词存在显著的n-gram重叠。已移除的配对样本集可在[此处](https://huggingface.co/datasets/chrisliu298/Skywork-Reward-Preference-80K-v0.1-Contaminated)获取。
>
> **若您的任务涉及基于[RewardBench](https://huggingface.co/datasets/allenai/reward-bench)的评估,我们强烈建议使用[Skywork-Reward-Preference-80K-v0.2](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2)替代本数据集的v0.1版本。**
Skywork偏好奖励数据集80K是一个包含8万条偏好配对的子集,数据均来自公开可用资源。该子集被用于训练[**Skywork-Reward-Gemma-2-27B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B)与[**Skywork-Reward-Llama-3.1-8B**](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B)两款奖励模型。
## 数据混合构成
我们精心构建了[Skywork奖励数据集合集](https://huggingface.co/collections/Skywork/skywork-reward-data-collection-66d7fda6a5098dc77035336d),一方面纳入高质量偏好配对样本,另一方面针对特定能力与知识领域进行定向采集。本次整理的训练数据集包含约8万条样本,均从多个公开数据源中进行二次采样得到,具体数据源包括:
1. [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2)
2. [OffsetBias](https://huggingface.co/datasets/NCSOFT/offsetbias)
3. [对抗式WildGuard](https://huggingface.co/allenai/wildguard)
4. Magpie DPO系列数据集:[Ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1)、[Llama-3.1版本Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1)、[Pro](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1)以及[Air](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1)。
**免责声明:除了对上述原始数据集进行二次采样以构建Skywork奖励数据集合集外,我们未对原始数据集进行任何修改。**
在数据集整理过程中,我们采用了多项技巧以在不损害整体性能的前提下,同时实现性能提升与各领域样本间的平衡:
1. 我们基于数据集附带的ArmoRM平均得分,分别从合并后的Magpie数据集的数学、代码及其他类别中筛选优质样本。为优先选取Magpie-Ultra与Magpie-Pro-Llama-3.1的样本,我们分别将Magpie-Air子集与Magpie-Pro子集的ArmoRM平均得分减去0.1与0.05。
2. 我们并未直接采用WildGuard中的全部偏好配对样本,而是先基于另外三个数据源训练一款奖励模型(Reward Model, RM)。随后,(1)使用该奖励模型对WildGuard中所有样本的选中回复与被拒回复进行评分,(2)仅保留选中回复的RM得分高于被拒回复的样本。经测试,该方法在提升安全性的同时,基本保留了Chat、Char hard与Reasoning任务的原有性能。两款奖励模型均使用27B版本对WildGuard样本进行评分。
## 技术报告
[Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs](https://arxiv.org/abs/2410.18451)
## 联系方式
若您有任何疑问,可通过以下邮箱联系我们:<yuhao.liuu@kunlun-inc.com> 或 <liang.zeng@kunlun-inc.com>。
## 引用
若您认为本工作对您有所帮助,可通过以下BibTeX条目引用我们:
bibtex
@article{liu2024skywork,
title={Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs},
author={Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui},
journal={arXiv preprint arXiv:2410.18451},
year={2024}
}
提供机构:
maas
创建时间:
2025-02-19



