Magpie-Llama-3.1-Pro-DPO-100K-v0.1
收藏魔搭社区2026-01-02 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1
下载链接
链接失效反馈官方服务:
资源简介:

Project Web: [https://magpie-align.github.io/](https://magpie-align.github.io/)
Arxiv Technical Report: [https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464)
Codes: [https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie)
## Abstract
<details><summary>Click Here</summary>
High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.
</details><be>
## Dataset Details
This dataset is generated by [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) for direct preference optimization.
To create the dataset, we first selected 100K high-quality Magpie instructions with diverse task categories, then generated responses using [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) 5 times for each instruction, using a temperature of 0.8. We then annotated RM scores using RLHFlow/ArmoRM-Llama3-8B-v0.1, labeling the response with the highest RM score as the chosen response, and the one with the lowest RM score as the rejected response.
**License**: Please follow [Meta Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE).
## 📚 Citation
If you find the model, data, or code useful, please cite our paper:
```
@article{xu2024magpie,
title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
year={2024},
eprint={2406.08464},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Please also cite the reward model for creating preference datasets:
ArmoRM paper:
```
@article{wang2024interpretable,
title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
journal={arXiv preprint arXiv:2406.12845},
year={2024}
}
```
**Questions?** Please contact [Zhangchen](https://zhangchenxu.com/) by email.
|Model Name | Dataset | Type | Description |
|-------------|:-------|:-------|:-------|
| [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-1M-v0.1) | SFT | 1M Raw conversations built with Meta Llama 3.1 70B.
| [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) | SFT | Apply a filter and select 300K high quality conversations.
| [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-MT-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-v0.1) | SFT | Select 300K high quality questions and extend to multi-turn conversations.
| [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-DPO-100K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1) | DPO | DPO dataset via Best-of-N sampling and rewards.

项目主页:[https://magpie-align.github.io/](https://magpie-align.github.io/)
arXiv技术报告:[https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464)
代码仓库:[https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie)
## 摘要
<details><summary>点击展开</summary>
高质量的指令数据对于对齐大语言模型(Large Language Model,LLM)至关重要。尽管部分模型(如Llama-3-Instruct)开放了模型权重,但其对齐数据仍处于私有状态,这阻碍了人工智能的民主化进程。现有的开源数据构建方法面临人工标注成本高昂、提示范围预先限定且有限的问题,难以实现有效扩展,进而可能限制了公开对齐数据集的多样性与质量。我们能否直接从已对齐的大语言模型中提取指令数据,从而规模化合成高质量的指令数据?为此,我们提出了一种用于生成大规模对齐数据的自合成方法,命名为Magpie。我们的核心观察是:得益于自回归特性,当我们仅输入用户消息预留位置之前的左侧模板时,像Llama-3-Instruct这类已对齐的大语言模型能够生成用户查询。我们利用该方法对Llama-3-Instruct进行提示,生成了400万条指令及其对应的响应。我们对提取得到的数据进行了全面分析,并筛选出30万条高质量样本。为了将Magpie数据集与其他公开指令数据集进行对比,我们使用每个数据集分别对Llama-3-8B-Base进行微调,并评估微调后模型的性能。实验结果表明,在部分任务中,使用Magpie数据集微调得到的模型性能可与官方的Llama-3-8B-Instruct相媲美——尽管后者通过监督微调(Supervised Fine-Tuning, SFT)和后续的反馈学习使用了1000万条数据进行增强。我们还证明,仅使用Magpie数据集进行监督微调,其性能可超越此前同时用于监督微调与偏好优化的公开数据集,例如结合UltraFeedback的直接偏好优化(Direct Preference Optimization, DPO)。这一优势在AlpacaEval、ArenaHard与WildBench等对齐基准测试中均有体现。
</details><br>
## 数据集详情
本数据集由Llama 3.1 70B Instruct(https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)生成,用于直接偏好优化任务。
为构建该数据集,我们首先选取了10万条覆盖多样化任务类别的高质量Magpie指令,随后针对每条指令使用Llama 3.1 70B Instruct生成5条响应,温度系数设置为0.8。接着我们使用RLHFlow/ArmoRM-Llama3-8B-v0.1标注奖励模型(Reward Model, RM)得分,将RM得分最高的响应标记为选中响应(chosen response),得分最低的标记为被拒绝响应(rejected response)。
**许可协议**:请遵循Meta Llama 3.1社区许可协议(https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)。
## 📚 引用
若您认为本模型、数据集或代码对您的研究有所帮助,请引用我们的论文:
@article{xu2024magpie,
title={Magpie: 基于空提示对齐大语言模型的对齐数据自合成},
author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
year={2024},
eprint={2406.08464},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
同时请引用用于构建偏好数据集的奖励模型相关论文:
ArmoRM论文:
@article{wang2024interpretable,
title={基于多目标奖励建模与混合专家模型的可解释偏好},
author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
journal={arXiv preprint arXiv:2406.12845},
year={2024}
}
**疑问咨询**:请通过邮件联系[张晨(Zhangchen)](https://zhangchenxu.com/)。
| 模型名称 | 数据集 | 类型 | 描述 |
|-------------|:-------|:-------|:-------|
| [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-1M-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-1M-v0.1) | 监督微调(SFT) | 由Meta Llama 3.1 70B生成的100万条原始对话数据。
| [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) | 监督微调(SFT) | 经过筛选后选取的30万条高质量对话数据。
| [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-MT-300K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-v0.1) | 监督微调(SFT) | 选取30万条高质量问题并扩展为多轮对话。
| [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1) | 直接偏好优化(DPO) | 通过最佳N次采样与奖励标注构建的DPO数据集。
提供机构:
maas
创建时间:
2025-01-15



