five

MagpieLM-DPO-Data-v0.1

收藏
魔搭社区2026-01-06 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/Magpie-Align/MagpieLM-DPO-Data-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
![Magpie](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/FWWILXrAGNwWr52aghV0S.png) Project Web: [https://magpie-align.github.io/](https://magpie-align.github.io/) Arxiv Technical Report: [https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464) Codes: [https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie) ## 🧐 Dataset Details The Magpie Team generates this dataset for direct preference optimization. This dataset was used to train [Magpie-Align/MagpieLM-4B-Chat-v0.1](https://huggingface.co/Magpie-Align/MagpieLM-4B-Chat-v0.1). This dataset is a combination of two datasets: - Half of the dataset (100K) is from [Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1/tree/main). - Another half of the dataset (100K) uses the instructions from [Magpie-Align/Magpie-Air-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1), then generated responses using [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it) 5 times for each instruction, using a temperature of 0.8. We then annotated RM scores using [RLHFlow/ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1), labeling the response with the highest RM score as the chosen response, and the one with the lowest RM score as the rejected response. Why Magpie 💜 Gemma-2-9B? Take a look at our latest paper: [Stronger Models are NOT Stronger Teachers for Instruction Tuning](https://huggingface.co/papers/2411.07133). We found that stronger models are not always stronger teachers for instruction tuning! **License**: Please follow [Meta Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) and [Gemma License](https://www.kaggle.com/models/google/gemma/license/). ## 📚 Citation If you find the model, data, or code useful, please cite our paper: ``` @article{xu2024magpie, title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing}, author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin}, year={2024}, eprint={2406.08464}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` Please also cite the reward model for creating preference datasets: ArmoRM paper: ``` @article{wang2024interpretable, title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts}, author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong}, journal={arXiv preprint arXiv:2406.12845}, year={2024} } @article{xu2024stronger, title={Stronger Models are NOT Stronger Teachers for Instruction Tuning}, author={Xu, Zhangchen and Jiang, Fengqing and Niu, Luyao and Lin, Bill Yuchen and Poovendran, Radha}, journal={arXiv preprint arXiv:2411.07133}, year={2024} } ``` **Contact** Questions? Contact: - [Zhangchen Xu](https://zhangchenxu.com/) [zxu9 at uw dot edu], and - [Bill Yuchen Lin](https://yuchenlin.xyz/) [yuchenlin1995 at gmail dot com]

![喜鹊(Magpie)](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/FWWILXrAGNwWr52aghV0S.png) 项目主页:[https://magpie-align.github.io/](https://magpie-align.github.io/) arXiv技术报告:[https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464) 代码仓库:[https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie) ## 🧐 数据集详情 Magpie团队构建该数据集用于直接偏好优化(Direct Preference Optimization)。本数据集曾用于训练[Magpie-Align/MagpieLM-4B-Chat-v0.1](https://huggingface.co/Magpie-Align/MagpieLM-4B-Chat-v0.1)。 本数据集由两份数据集合并而成: - 数据集的一半(10万条)源自[Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1/tree/main)。 - 剩余一半(10万条)则采用[Magpie-Align/Magpie-Air-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-DPO-100K-v0.1)中的指令集,随后针对每条指令使用[google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it)生成5次回复,温度系数设为0.8。随后我们使用[RLHFlow/ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1)标注奖励模型(Reward Model, RM)得分,将RM得分最高的回复标记为优选回复(chosen response),得分最低的标记为拒选回复(rejected response)。 为何选择Gemma-2-9B?请参阅我们的最新论文:[更强的模型并非指令微调的更佳教师](https://huggingface.co/papers/2411.07133)。我们的研究表明,更强的模型并不总能成为指令微调的更优教师! **许可证**:请遵循[Meta Llama 3.1社区许可证](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)与[Gemma许可证](https://www.kaggle.com/models/google/gemma/license/)。 ## 📚 引用 若您认为本模型、数据集或代码对您的研究有所帮助,请引用我们的论文: @article{xu2024magpie, title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing}, author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin}, year={2024}, eprint={2406.08464}, archivePrefix={arXiv}, primaryClass={cs.CL} } 同时请引用用于构建偏好数据集的奖励模型相关论文: ArmoRM论文: @article{wang2024interpretable, title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts}, author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong}, journal={arXiv preprint arXiv:2406.12845}, year={2024} } @article{xu2024stronger, title={Stronger Models are NOT Stronger Teachers for Instruction Tuning}, author={Xu, Zhangchen and Jiang, Fengqing and Niu, Luyao and Lin, Bill Yuchen and Poovendran, Radha}, journal={arXiv preprint arXiv:2411.07133}, year={2024} } **联系方式** 如有疑问,请联系: - [徐张晨(Zhangchen Xu)](https://zhangchenxu.com/) [zxu9 at uw dot edu],以及 - [林宇宸(Bill Yuchen Lin)](https://yuchenlin.xyz/) [yuchenlin1995 at gmail dot com]
提供机构:
maas
创建时间:
2025-01-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作