five

tpo-alignment/triple-preference-ultrafeedback-40K

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/tpo-alignment/triple-preference-ultrafeedback-40K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en size_categories: - 10K<n<100K task_categories: - text-generation tags: - rlhf - alignment pretty_name: tpo-ultrafeedback --- # Dataset Card for llama3-ultrafeedback-armorm This dataset was used to train [tpo-alignment/Llama-3-8B-TPO-L-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-L-40k), [tpo-alignment/Llama-3-8B-TPO-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-40k), and [tpo-alignment/Mistral-7B-TPO-40k](https://huggingface.co/tpo-alignment/Mistral-7B-TPO-40k). ## Dataset Creation This dataset is built based on the [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback). We reconstruct UltraFeedback to select three preferences per prompt. First, we rank the responses based on the scores provided in the base dataset. The highest-scoring response is selected as the `reference`, the second-highest response with at least a 2-point lower score than the highest is selected as the `chosen`, and a third response with at least a 2-point lower score than the chosen is selected as the `rejected`. We discard samples that do not meet these criteria, resulting in a final dataset of approximately **40K** samples. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> UltraFeedback paper: ``` @article{cui2023ultrafeedback, title={{UltraFeedback}: Boosting language models with high-quality feedback}, author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong}, journal={arXiv preprint arXiv:2310.01377}, year={2023} } ``` TPO paper: ``` @article{ saeidi2025triple, title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, author={Amir Saeidi and Shivanshu Verma and Kashif Rasul and Aswin RRV and Chitta Baral}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=A4jyaZheE8}, } ```

许可证:MIT协议 语言: - 英语 样本规模类别: - 10K<n<100K(即10000 < 样本数量 < 100000) 任务类别: - 文本生成 标签: - 强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF) - 对齐(Alignment) 友好名称:tpo-ultrafeedback # 数据集卡片:llama3-ultrafeedback-armorm 本数据集用于训练[tpo-alignment/Llama-3-8B-TPO-L-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-L-40k)、[tpo-alignment/Llama-3-8B-TPO-40k](https://huggingface.co/tpo-alignment/Llama-3-8B-TPO-40k)以及[tpo-alignment/Mistral-7B-TPO-40k](https://huggingface.co/tpo-alignment/Mistral-7B-TPO-40k)。 ## 数据集构建 本数据集基于[UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)构建。我们对UltraFeedback进行重构,为每个提示词选取三条偏好响应。首先,我们根据基础数据集中提供的分数对响应进行排序:将得分最高的响应选为`reference`,将得分比最高值至少低2分的次高响应选为`chosen`,并将得分比`chosen`至少低2分的第三条响应选为`rejected`。我们将过滤掉不符合上述标准的样本,最终得到约**40K**个样本。 ## 引用说明 <!-- 若有介绍该数据集的论文或博客文章,请在此处添加其APA和Bibtex引用信息。 --> UltraFeedback相关论文: @article{cui2023ultrafeedback, title={{UltraFeedback}: Boosting language models with high-quality feedback}, author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong}, journal={arXiv preprint arXiv:2310.01377}, year={2023} } TPO相关论文: @article{ saeidi2025triple, title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, author={Amir Saeidi and Shivanshu Verma and Kashif Rasul and Aswin RRV and Chitta Baral}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={https://openreview.net/forum?id=A4jyaZheE8}, }
提供机构:
tpo-alignment
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作