ultrafeedback-binarized-preferences-cleaned-kto
收藏魔搭社区2026-05-24 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/ultrafeedback-binarized-preferences-cleaned-kto
下载链接
链接失效反馈官方服务:
资源简介:
# UltraFeedback - Binarized using the Average of Preference Ratings (Cleaned) KTO
> A KTO signal transformed version of the highly loved [UltraFeedback Binarized Preferences Cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned), the preferred dataset by Argilla to use from now on when fine-tuning on UltraFeedback
This dataset represents a new iteration on top of [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/argilla/ultrafeedback-binarized-preferences),
and is the **recommended and preferred dataset by Argilla to use from now on when fine-tuning on UltraFeedback**.
Read more about Argilla's approach towards UltraFeedback binarization at [`argilla/ultrafeedback-binarized-preferences/README.md`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences/blob/main/README.md).
## Why KTO?
The [KTO paper](https://arxiv.org/abs/2402.01306) states:
- KTO matches or exceeds DPO performance at scales from 1B to 30B parameters.1 That is, taking a preference dataset of n DPO pairs and breaking it up into 2n examples for KTO can yield better generations, despite the model ostensibly learning from a weaker signal.
- KTO can handle extreme data imbalances, matching DPO performance while using up to 90% fewer desirable examples (i.e., examples of good generations). Its success thus cannot be ascribed to the alignment data being sourced from a preference dataset.
- When the pretrained model is sufficiently good, one can skip supervised finetuning and go straight to KTO without a loss in generation quality. In contrast, we find that without doing SFT first, DPO-aligned models are significantly worse at all scales.
## Reproduce KTO Transformation
Orginal [UltraFeedback binarized prefrence cleaned DPO dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned)
<a target="_blank" href="https://colab.research.google.com/drive/10MwyxzcQogwO8e1ZcVu7aGTQvjXWpFuD?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
# UltraFeedback —— 基于偏好评分平均值二值化(清理版)KTO数据集
> 本数据集是广受赞誉的[UltraFeedback 二值化偏好清理版](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned)的KTO信号转换版本,也是Argilla官方推荐今后用于UltraFeedback微调的首选数据集。
本数据集是在[`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/argilla/ultrafeedback-binarized-preferences)基础上的全新迭代版本,同时也是**Argilla官方推荐今后用于UltraFeedback微调的首选数据集**。
如需了解Argilla针对UltraFeedback二值化的具体实现方案,可查阅[`argilla/ultrafeedback-binarized-preferences/README.md`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences/blob/main/README.md)。
## 为何选择KTO?
[KTO论文](https://arxiv.org/abs/2402.01306)指出:
- 在10亿至300亿参数的模型规模下,KTO的性能可媲美甚至超越Direct Preference Optimization(DPO)。也就是说,将原本包含n个DPO样本对的偏好数据集拆分为2n个KTO样本后,即便模型看似学习到的信号强度更弱,却能生成质量更优的内容。
- KTO可应对极端的数据不平衡问题,在仅使用最多90%优质样本(即优质生成内容样本)的情况下,仍能达到与DPO相当的性能。因此其对齐效果并非依赖于偏好数据集的来源特性。
- 当预训练模型性能足够优异时,可跳过监督微调(Supervised Fine-Tuning, SFT)步骤直接进行KTO微调,且生成质量不会出现下降。与之形成对比的是,我们的实验发现,若不先进行SFT,所有规模的DPO对齐模型性能都会出现显著下滑。
## KTO转换复现
原始[UltraFeedback 二值化偏好清理版DPO数据集](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned)
<a target="_blank" href="https://colab.research.google.com/drive/10MwyxzcQogwO8e1ZcVu7aGTQvjXWpFuD?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在Colab中打开"/>
</a>
提供机构:
maas
创建时间:
2024-06-02



