tpo-alignment/triple-preference-ultrafeedback-40K

Name: tpo-alignment/triple-preference-ultrafeedback-40K
Creator: tpo-alignment
Published: 2026-04-14 19:42:53
License: 暂无描述

Hugging Face2026-04-14 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/tpo-alignment/triple-preference-ultrafeedback-40K

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en size_categories: - 10K<n<100K task_categories: - text-generation tags: - rlhf - alignment pretty_name: tpo-ultrafeedback --- # Dataset Card for llama3-ultrafeedback-armorm This dataset was used to train [tpo-alignment/Llama-3-8B-TPO-L-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-L-40k), [tpo-alignment/Llama-3-8B-TPO-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-40k), and [tpo-alignment/Mistral-7B-TPO-40k](https://huggingface.co/tpo-alignment/Mistral-7B-TPO-40k). ## Dataset Creation This dataset is built based on the [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback). We reconstruct UltraFeedback to select three preferences per prompt. First, we rank the responses based on the scores provided in the base dataset. The highest-scoring response is selected as the `reference`, the second-highest response with at least a 2-point lower score than the highest is selected as the `chosen`, and a third response with at least a 2-point lower score than the chosen is selected as the `rejected`. We discard samples that do not meet these criteria, resulting in a final dataset of approximately **40K** samples. ## Citation  UltraFeedback paper: ``` @article{cui2023ultrafeedback, title={{UltraFeedback}: Boosting language models with high-quality feedback}, author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong}, journal={arXiv preprint arXiv:2310.01377}, year={2023} } ``` TPO paper: ``` @article{ saeidi2025triple, title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, author={Amir Saeidi and Shivanshu Verma and Kashif Rasul and Aswin RRV and Chitta Baral}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=A4jyaZheE8}, } ```

许可证：MIT协议语言： - 英语样本规模类别： - 10K<n<100K（即10000 < 样本数量 < 100000）任务类别： - 文本生成标签： - 强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF） - 对齐（Alignment）友好名称：tpo-ultrafeedback # 数据集卡片：llama3-ultrafeedback-armorm 本数据集用于训练[tpo-alignment/Llama-3-8B-TPO-L-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-L-40k)、[tpo-alignment/Llama-3-8B-TPO-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-40k)以及[tpo-alignment/Mistral-7B-TPO-40k](https://huggingface.co/tpo-alignment/Mistral-7B-TPO-40k)。 ## 数据集构建本数据集基于[UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)构建。我们对UltraFeedback进行重构，为每个提示词选取三条偏好响应。首先，我们根据基础数据集中提供的分数对响应进行排序：将得分最高的响应选为`reference`，将得分比最高值至少低2分的次高响应选为`chosen`，并将得分比`chosen`至少低2分的第三条响应选为`rejected`。我们将过滤掉不符合上述标准的样本，最终得到约**40K**个样本。 ## 引用说明  UltraFeedback相关论文： @article{cui2023ultrafeedback, title={{UltraFeedback}: Boosting language models with high-quality feedback}, author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong}, journal={arXiv preprint arXiv:2310.01377}, year={2023} } TPO相关论文： @article{ saeidi2025triple, title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, author={Amir Saeidi and Shivanshu Verma and Kashif Rasul and Aswin RRV and Chitta Baral}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=A4jyaZheE8}, }

提供机构：

tpo-alignment

5,000+

优质数据集

54 个

任务类型

进入经典数据集