ultrafeedback-binarized-preferences-cleaned-kto

Name: ultrafeedback-binarized-preferences-cleaned-kto
Creator: maas
Published: 2026-05-24 02:23:35
License: 暂无描述

魔搭社区2026-05-24 更新2024-06-08 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/ultrafeedback-binarized-preferences-cleaned-kto

下载链接

链接失效反馈

官方服务：

资源简介：

# UltraFeedback - Binarized using the Average of Preference Ratings (Cleaned) KTO > A KTO signal transformed version of the highly loved [UltraFeedback Binarized Preferences Cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned), the preferred dataset by Argilla to use from now on when fine-tuning on UltraFeedback This dataset represents a new iteration on top of [`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/argilla/ultrafeedback-binarized-preferences), and is the **recommended and preferred dataset by Argilla to use from now on when fine-tuning on UltraFeedback**. Read more about Argilla's approach towards UltraFeedback binarization at [`argilla/ultrafeedback-binarized-preferences/README.md`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences/blob/main/README.md). ## Why KTO? The [KTO paper](https://arxiv.org/abs/2402.01306) states: - KTO matches or exceeds DPO performance at scales from 1B to 30B parameters.1 That is, taking a preference dataset of n DPO pairs and breaking it up into 2n examples for KTO can yield better generations, despite the model ostensibly learning from a weaker signal. - KTO can handle extreme data imbalances, matching DPO performance while using up to 90% fewer desirable examples (i.e., examples of good generations). Its success thus cannot be ascribed to the alignment data being sourced from a preference dataset. - When the pretrained model is sufficiently good, one can skip supervised finetuning and go straight to KTO without a loss in generation quality. In contrast, we find that without doing SFT first, DPO-aligned models are significantly worse at all scales. ## Reproduce KTO Transformation Orginal [UltraFeedback binarized prefrence cleaned DPO dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) <a target="_blank" href="https://colab.research.google.com/drive/10MwyxzcQogwO8e1ZcVu7aGTQvjXWpFuD?usp=sharing"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

# UltraFeedback —— 基于偏好评分平均值二值化（清理版）KTO数据集 > 本数据集是广受赞誉的[UltraFeedback 二值化偏好清理版](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned)的KTO信号转换版本，也是Argilla官方推荐今后用于UltraFeedback微调的首选数据集。本数据集是在[`argilla/ultrafeedback-binarized-preferences`](https://huggingface.co/argilla/ultrafeedback-binarized-preferences)基础上的全新迭代版本，同时也是**Argilla官方推荐今后用于UltraFeedback微调的首选数据集**。如需了解Argilla针对UltraFeedback二值化的具体实现方案，可查阅[`argilla/ultrafeedback-binarized-preferences/README.md`](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences/blob/main/README.md)。 ## 为何选择KTO？ [KTO论文](https://arxiv.org/abs/2402.01306)指出： - 在10亿至300亿参数的模型规模下，KTO的性能可媲美甚至超越Direct Preference Optimization（DPO）。也就是说，将原本包含n个DPO样本对的偏好数据集拆分为2n个KTO样本后，即便模型看似学习到的信号强度更弱，却能生成质量更优的内容。 - KTO可应对极端的数据不平衡问题，在仅使用最多90%优质样本（即优质生成内容样本）的情况下，仍能达到与DPO相当的性能。因此其对齐效果并非依赖于偏好数据集的来源特性。 - 当预训练模型性能足够优异时，可跳过监督微调（Supervised Fine-Tuning, SFT）步骤直接进行KTO微调，且生成质量不会出现下降。与之形成对比的是，我们的实验发现，若不先进行SFT，所有规模的DPO对齐模型性能都会出现显著下滑。 ## KTO转换复现原始[UltraFeedback 二值化偏好清理版DPO数据集](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) <a target="_blank" href="https://colab.research.google.com/drive/10MwyxzcQogwO8e1ZcVu7aGTQvjXWpFuD?usp=sharing"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="在Colab中打开"/> </a>

提供机构：

maas

创建时间：

2024-06-02

搜集汇总

数据集介绍