tpo-alignment/triple-preference-ultrafeedback-40K
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/tpo-alignment/triple-preference-ultrafeedback-40K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
size_categories:
- 10K<n<100K
task_categories:
- text-generation
tags:
- rlhf
- alignment
pretty_name: tpo-ultrafeedback
---
# Dataset Card for llama3-ultrafeedback-armorm
This dataset was used to train [tpo-alignment/Llama-3-8B-TPO-L-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-L-40k), [tpo-alignment/Llama-3-8B-TPO-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-40k), and [tpo-alignment/Mistral-7B-TPO-40k](https://huggingface.co/tpo-alignment/Mistral-7B-TPO-40k).
## Dataset Creation
This dataset is built based on the [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback). We reconstruct UltraFeedback to select three preferences per prompt. First, we rank the responses based on the scores provided in the base dataset. The highest-scoring response is selected as the `reference`, the second-highest response with at least a 2-point lower score than the highest is selected as the `chosen`, and a third response with at least a 2-point lower score than the chosen is selected as the `rejected`. We discard samples that do not meet these criteria, resulting in a final dataset of approximately **40K** samples.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
UltraFeedback paper:
```
@article{cui2023ultrafeedback,
title={{UltraFeedback}: Boosting language models with high-quality feedback},
author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
journal={arXiv preprint arXiv:2310.01377},
year={2023}
}
```
TPO paper:
```
@article{
saeidi2025triple,
title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization},
author={Amir Saeidi and Shivanshu Verma and Kashif Rasul and Aswin RRV and Chitta Baral},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=A4jyaZheE8},
}
```
许可证:MIT协议
语言:
- 英语
样本规模类别:
- 10K<n<100K(即10000 < 样本数量 < 100000)
任务类别:
- 文本生成
标签:
- 强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)
- 对齐(Alignment)
友好名称:tpo-ultrafeedback
# 数据集卡片:llama3-ultrafeedback-armorm
本数据集用于训练[tpo-alignment/Llama-3-8B-TPO-L-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-L-40k)、[tpo-alignment/Llama-3-8B-TPO-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-40k)以及[tpo-alignment/Mistral-7B-TPO-40k](https://huggingface.co/tpo-alignment/Mistral-7B-TPO-40k)。
## 数据集构建
本数据集基于[UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)构建。我们对UltraFeedback进行重构,为每个提示词选取三条偏好响应。首先,我们根据基础数据集中提供的分数对响应进行排序:将得分最高的响应选为`reference`,将得分比最高值至少低2分的次高响应选为`chosen`,并将得分比`chosen`至少低2分的第三条响应选为`rejected`。我们将过滤掉不符合上述标准的样本,最终得到约**40K**个样本。
## 引用说明
<!-- 若有介绍该数据集的论文或博客文章,请在此处添加其APA和Bibtex引用信息。 -->
UltraFeedback相关论文:
@article{cui2023ultrafeedback,
title={{UltraFeedback}: Boosting language models with high-quality feedback},
author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
journal={arXiv preprint arXiv:2310.01377},
year={2023}
}
TPO相关论文:
@article{
saeidi2025triple,
title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization},
author={Amir Saeidi and Shivanshu Verma and Kashif Rasul and Aswin RRV and Chitta Baral},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=A4jyaZheE8},
}
提供机构:
tpo-alignment



