tulu-2.5-preference-data

Name: tulu-2.5-preference-data
Creator: maas
Published: 2026-01-06 16:33:52
License: 暂无描述

魔搭社区2026-01-06 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/allenai/tulu-2.5-preference-data

下载链接

链接失效反馈

官方服务：

资源简介：

<center> <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-2.5/tulu_25_banner.png" alt="Tulu 2.5 banner image" width="800px"/> </center> # Tulu 2.5 Preference Data This dataset contains the preference dataset splits used to train the models described in [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo). We cleaned and formatted all datasets to be in the same format. This means some splits may differ from their original format. To see the code used for creating most splits, see [here](https://github.com/hamishivi/EasyLM/blob/main/conversion_scripts/convert_preference_data.py). If you only wish to download one dataset, each dataset exists in one file under the `data/` directory in this repository. ## Dataset Details - **Language(s) (NLP):** English (mostly - we did not explicitly clean non-English data). - **License:** ODC-BY. Note that different splits may have additional license details, noted below. The description of each split goes as follows: - **alpaca_farm_gpt4_pref**: The GPT-4 preference split from the [AlpacaFarm dataset](https://huggingface.co/datasets/tatsu-lab/alpaca_farm). CC-BY-NC-4.0 license. - **alpaca_farm_human_pref**: The human preference split from the [AlpacaFarm dataset](https://huggingface.co/datasets/tatsu-lab/alpaca_farm). CC-BY-NC-4.0 license. - **capybara**: The [7k DPO binarized Capybara dataset](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized) from Argilla. Apache 2.0 license. - **chatbot_arena_2023**: The [Chatbot Arena conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations) from 2023. The user prompts are licensed under CC-BY-4.0, while the model outputs are licensed under CC-BY-NC-4.0. - **chatbot_arena_2024**: The [Chatbot Arena human preferences dataset](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) from 2024. Apache 2.0 license. - **helpsteer**: A binarized form of the [HelpSteer dataset](https://huggingface.co/datasets/nvidia/HelpSteer). We average aspects (except verbosity) to chose chosen and rejected pairs. CC-BY-4.0 license. - **hh_rlhf**: The [Anthropic HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf), formatted and cleaned. MIT license. - **nectar**: The [Nectar dataset](https://huggingface.co/datasets/berkeley-nest/Nectar) used for the Starling models, formatted and cleaned. Apache 2.0 license. - **orca_dpo_pairs**: The [Intel Orca DPO pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs), more specifically [the Argilla cleaned version](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs). Apache 2.0 license. - **prm800k_pairs_phase2**: The train split data from the second phase of [PRM800k](https://github.com/openai/prm800k/blob/main/prm800k/data/phase2_train.jsonl) annotations, formatted into preference data. MIT license. - **shp_2**: The [SHP-2 dataset](https://huggingface.co/datasets/stanfordnlp/SHP-2), randomly downsampled to 500k samples. Please see the [SHP-2 page](https://huggingface.co/datasets/stanfordnlp/SHP-2) for license details - the reddit data is licensed under a historical variant of the reddit license, and the stack-exchange data licensed under CC-BY-SA. - **stack_exchange_paired**: The [StackExchange paired dataset](https://huggingface.co/datasets/lvwerra/stack-exchange-paired), randomly downsampled to 500k samples. CC-BY-SA-4.0 license. - **ultrafeedback_mean_aspects**: The [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback), specifically the [Argilla cleaned version](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned). MIT license. - **ultrafeedback_overall**: The [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback), specifically the [Argilla cleaned version](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned). We re-order chosen and rejected to match the overall score given by GPT-4, instead of averaging the per-aspect scores. MIT license. - **hh_rlhf_60k**: A random 60,908 subsample of the above HH-RLHF data. Used for the PPO experiments in our paper. MIT license. - **nectar_60k**: A random 60,908 subsample of the above nectar data. Used for the PPO experiments in our paper. Apache 2.0 license. - **stack_exchange_60k**: A random 60,908 subsample of the above StackExchange paired data. Used for the PPO experiments in our paper. CC-BY-SA-4.0 license. - **preference_big_mixture**: A mixture of the HelpSteer, PRM800k, HH-RLHF, Nectar, StackExchange, and UltraFeedback datasets. We randomly downsample StackExchange, HH-RLHF, and Nectar to 60,908 samples each. For the licenses for these datasets, see the corresponding split above. - **ultrafeedback_lower_10k**: A random subsample of 10k samples from ultrafeedback_mean_aspects that contain only generations from wizardlm-7b, llama-2-7b-chat, starchat, alpaca-7b, pythia-12b, falcon-40b-instruct. MIT license. - **ultrafeedback_middle_10k**: A random subsample of 10k samples from ultrafeedback_mean_aspects that contain only generations from vicuna-33b, mpt-30b-chat, llama-2-70b-chat, wizardlm-13b, llama-2-13b-chat, ultralm-65b, ultralm-13b. MIT license. - **ultrafeedback_top_10k**: A random subsample of 10k samples from ultrafeedback_mean_aspects that contain only generations from gpt-4, gpt-3.5, wizardlm-70b, bard. MIT license. - **ultrafeedback_evol_instruct**: The set of samples from ultrafeedback_mean_aspects using prompts originally from Evol-Instruct. MIT license. - **ultrafeedback_false_qa**: The set of samples from ultrafeedback_mean_aspects using prompts originally from [FalseQA](https://github.com/thunlp/FalseQA). MIT license. - **ultrafeedback_flan_v2**: The set of samples from ultrafeedback_mean_aspects using prompts originally from [Flan V2](https://github.com/google-research/FLAN/tree/main/flan/v2). MIT license. - **ultrafeedback_sharegpt**: The set of samples from ultrafeedback_mean_aspects using prompts originally from ShareGPT. MIT license. - **ultrafeedback_truthful_qa**: The set of samples from ultrafeedback using prompts originally from [TruthfulQA](https://github.com/sylinrl/TruthfulQA). Note these prompts are not included in all other UltraFeedback splits (including ultrafeedback_mean_aspects and ultrafeedback_overall). MIT license. - **ultrafeedback_ultrachat**: The set of samples from ultrafeedback using prompts originally from [UltraChat](https://github.com/thunlp/UltraChat). MIT license. For more details, please see our paper [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279) - specifically the appendices for details on dataset construction. ## Uses This dataset is intended for use in research when training models with varied RLHF methods. ## Citation If you find this data useful, please cite: ```bibtex @misc{ivison2024unpacking, title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}}, author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}} year={2024}, eprint={2406.09279}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

<center> <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-2.5/tulu_25_banner.png" alt="Tulu 2.5 banner image" width="800px"/> </center> # Tulu 2.5 偏好数据集本数据集包含用于训练《Unpacking 直接偏好优化（Direct Preference Optimization，DPO）与近端策略优化（Proximal Policy Optimization，PPO）：拆解基于偏好反馈学习的最佳实践》（https://link.todo）中所述模型的偏好数据集拆分。我们对所有数据集进行了清洗与格式化处理，使其采用统一格式，因此部分拆分可能与原始格式存在差异。若需查看用于生成多数数据集拆分的代码，请访问[此处](https://github.com/hamishivi/EasyLM/blob/main/conversion_scripts/convert_preference_data.py)。若仅需下载单个数据集，本仓库的`data/`目录下的每个文件对应一个独立数据集。 ## 数据集详情 - **自然语言语言类型**：以英语为主（未对非英语数据进行显式清洗）。 - **授权协议**：ODC-BY。请注意，部分数据集拆分可能附带额外授权条款，详见下文。各数据集拆分的说明如下： - **alpaca_farm_gpt4_pref**：源自[AlpacaFarm数据集](https://huggingface.co/datasets/tatsu-lab/alpaca_farm)的GPT-4偏好拆分，采用CC-BY-NC-4.0授权协议。 - **alpaca_farm_human_pref**：源自[AlpacaFarm数据集](https://huggingface.co/datasets/tatsu-lab/alpaca_farm)的人类偏好拆分，采用CC-BY-NC-4.0授权协议。 - **capybara**：来自Argilla的[7k DPO二分类Capybara数据集](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized)，采用Apache 2.0授权协议。 - **chatbot_arena_2023**：2023年的[Chatbot Arena对话数据集](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations)，其中用户提示语采用CC-BY-4.0授权，模型输出采用CC-BY-NC-4.0授权。 - **chatbot_arena_2024**：2024年的[Chatbot Arena人类偏好数据集](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k)，采用Apache 2.0授权协议。 - **helpsteer**：[HelpSteer数据集](https://huggingface.co/datasets/nvidia/HelpSteer)的二分类版本，我们通过对除冗长度外的各项指标取平均来选取优选样本与劣选样本对，采用CC-BY-4.0授权协议。 - **hh_rlhf**：经格式化与清洗后的[Anthropic HH-RLHF数据集](https://huggingface.co/datasets/Anthropic/hh-rlhf)，采用MIT授权协议。 - **nectar**：用于Starling模型的[Nectar数据集](https://huggingface.co/datasets/berkeley-nest/Nectar)，经格式化与清洗后使用，采用Apache 2.0授权协议。 - **orca_dpo_pairs**：[Intel Orca DPO样本对数据集](https://huggingface.co/datasets/Intel/orca_dpo_pairs)，具体采用[Argilla清洗版本](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs)，采用Apache 2.0授权协议。 - **prm800k_pairs_phase2**：[PRM800k](https://github.com/openai/prm800k/blob/main/prm800k/data/phase2_train.jsonl)第二阶段训练拆分的标注数据，格式化为偏好数据集，采用MIT授权协议。 - **shp_2**：[SHP-2数据集](https://huggingface.co/datasets/stanfordnlp/SHP-2)，经随机下采样至50万条样本。授权详情请参阅[SHP-2数据集页面](https://huggingface.co/datasets/stanfordnlp/SHP-2)：其中Reddit数据采用历史版Reddit授权协议，Stack Exchange数据采用CC-BY-SA授权协议。 - **stack_exchange_paired**：[StackExchange配对数据集](https://huggingface.co/datasets/lvwerra/stack-exchange-paired)，经随机下采样至50万条样本，采用CC-BY-SA-4.0授权协议。 - **ultrafeedback_mean_aspects**：[UltraFeedback数据集](https://huggingface.co/datasets/openbmb/UltraFeedback)的[Argilla清洗版本](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned)，采用MIT授权协议。 - **ultrafeedback_overall**：[UltraFeedback数据集](https://huggingface.co/datasets/openbmb/UltraFeedback)的[Argilla清洗版本](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned)，我们根据GPT-4给出的整体评分调整优选与劣选样本的顺序，而非对各项指标的评分取平均，采用MIT授权协议。 - **hh_rlhf_60k**：上述HH-RLHF数据集的随机子样本，共60,908条，用于本文中的PPO实验，采用MIT授权协议。 - **nectar_60k**：上述Nectar数据集的随机子样本，共60,908条，用于本文中的PPO实验，采用Apache 2.0授权协议。 - **stack_exchange_60k**：上述StackExchange配对数据集的随机子样本，共60,908条，用于本文中的PPO实验，采用CC-BY-SA-4.0授权协议。 - **preference_big_mixture**：由HelpSteer、PRM800k、HH-RLHF、Nectar、StackExchange与UltraFeedback数据集混合而成的数据集。我们将StackExchange、HH-RLHF与Nectar数据集随机下采样至各60,908条样本。各数据集的授权协议请参阅上文对应的拆分说明。 - **ultrafeedback_lower_10k**：从ultrafeedback_mean_aspects中随机抽取的1万条样本，仅包含来自wizardlm-7b、llama-2-7b-chat、starchat、alpaca-7b、pythia-12b、falcon-40b-instruct的生成结果，采用MIT授权协议。 - **ultrafeedback_middle_10k**：从ultrafeedback_mean_aspects中随机抽取的1万条样本，仅包含来自vicuna-33b、mpt-30b-chat、llama-2-70b-chat、wizardlm-13b、llama-2-13b-chat、ultralm-65b、ultralm-13b的生成结果，采用MIT授权协议。 - **ultrafeedback_top_10k**：从ultrafeedback_mean_aspects中随机抽取的1万条样本，仅包含来自gpt-4、gpt-3.5、wizardlm-70b、bard的生成结果，采用MIT授权协议。 - **ultrafeedback_evol_instruct**：ultrafeedback_mean_aspects中源自Evol-Instruct的样本集合，采用MIT授权协议。 - **ultrafeedback_false_qa**：ultrafeedback_mean_aspects中源自[FalseQA数据集](https://github.com/thunlp/FalseQA)的样本集合，采用MIT授权协议。 - **ultrafeedback_flan_v2**：ultrafeedback_mean_aspects中源自[Flan V2](https://github.com/google-research/FLAN/tree/main/flan/v2)的样本集合，采用MIT授权协议。 - **ultrafeedback_sharegpt**：ultrafeedback_mean_aspects中源自ShareGPT的样本集合，采用MIT授权协议。 - **ultrafeedback_truthful_qa**：ultrafeedback中源自[TruthfulQA数据集](https://github.com/sylinrl/TruthfulQA)的样本集合。请注意，这些提示语未包含在其他UltraFeedback拆分（包括ultrafeedback_mean_aspects与ultrafeedback_overall）中，采用MIT授权协议。 - **ultrafeedback_ultrachat**：ultrafeedback中源自[UltraChat数据集](https://github.com/thunlp/UltraChat)的样本集合，采用MIT授权协议。如需了解更多细节，请参阅我们的论文《Unpacking 直接偏好优化（DPO）与近端策略优化（PPO）：拆解基于偏好反馈学习的最佳实践》（https://arxiv.org/abs/2406.09279），尤其是其中关于数据集构建的附录部分。 ## 使用场景本数据集旨在用于研究场景，支持采用各类基于人类反馈的强化学习（Reinforcement Learning from Human Feedback，RLHF）方法训练模型。 ## 引用若您认为本数据集对您的研究有所帮助，请引用如下文献： bibtex @misc{ivison2024unpacking, title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}}, author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}} year={2024}, eprint={2406.09279}, archivePrefix={arXiv}, primaryClass={cs.CL} }

提供机构：

maas

创建时间：

2025-05-28

搜集汇总

数据集介绍