five

Magpie-Llama-3.1-Pro-MT-300K-Filtered

收藏
魔搭社区2025-11-07 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered
下载链接
链接失效反馈
官方服务:
资源简介:
![Magpie](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/FWWILXrAGNwWr52aghV0S.png) Project Web: [https://magpie-align.github.io/](https://magpie-align.github.io/) Arxiv Technical Report: [https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464) Codes: [https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie) ## Abstract <details><summary>Click Here</summary> High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench. </details><be> ## Dataset Details This dataset is generated by [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) using [Magpie](https://huggingface.co/Magpie-Align). Please refer to our [paper](https://arxiv.org/abs/2406.08464) and [codebase](https://github.com/magpie-align/magpie) for implementation details. **License**: Please follow [Meta Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE). ### Available Labels - **Input Length**: The total number of characters in the instructions. - **Output Length**: The total number of characters in the responses. - **Task Category**: The specific category of the instructions. - **Input Quality**: The clarity, specificity, and coherence of the instructions, rated as 'very poor', 'poor', 'average', 'good', and 'excellent'. - **Input Difficulty**: The level of knowledge required to address the task described in the instruction, rated as 'very easy', 'easy', 'medium', 'hard', or 'very hard'. - **Minimum Neighbor Distance**: The embedding distance to the nearest neighbor within the dataset. It can be used for filtering out repetitive or similar instances. - **Safety**: Safety tags marked by [meta-llama/Meta-Llama-Guard-2-8B](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B) - **Reward**: The output of the reward model given the specific instruction-response pair. - **Language**: The language of the instruction. ## Filter Setups We note that [Magpie-Llama-3.1-Pro-MT-500K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-500K-v0.1) has a large amount of chain-of-thought responses, which is not necessary. Therefore, in this dataset, we reduce the amount of data containing `## Step 1`. To create this multi-turn dataset, we first filtered [Magpie-Llama-3.1-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-1M-v0.1) with the following setups: - **Input Quality**: >= good - **Instruction Reward**: >=-10 - Remove repetition and incomplete instructions (e.g., end with :) - Choose instructions with `\n`<5 except for coding & debugging - Choose 500K data with the longest responses => [Magpie-Llama-3.1-Pro-500K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-500K-Filtered) We then extend them to multi-turn conversations: => [Magpie-Llama-3.1-Pro-MT-500K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-500K-v0.1) We finally get a 300K subset with the following setups: - Removing incomplete second-turn instructions (e.g., end with :) - Reducing the amount of data containing `## Step 1` in responses ## Dataset Navigation 🧭 |Model Name | Dataset | Type | Description | |-------------|:-------|:-------|:-------| | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-1M-v0.1) | SFT | 1M Raw conversations built with Meta Llama 3.1 70B. | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) | SFT | Apply a filter and select 300K high quality conversations. | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-500K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-500K-Filtered) | SFT | Apply a filter and select 500K high quality conversations. | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-MT-500K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-500K-v0.1) | SFT | Extend Magpie-Llama-3.1-Pro-500K-Filtered to multi-turn. | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-MT-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered) | SFT | Select 300K high quality multi-turn conversations from Magpie-Llama-3.1-Pro-MT-500K. | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-DPO-100K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1) | DPO | DPO dataset via Best-of-N sampling and rewards.

![Magpie](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/FWWILXrAGNwWr52aghV0S.png) 项目主页:[https://magpie-align.github.io/](https://magpie-align.github.io/) arXiv技术报告:[https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464) 代码仓库:[https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie) ## 摘要 <details><summary>点击展开</summary> 高质量指令数据对大语言模型(Large Language Model,LLM)的对齐至关重要。尽管诸如Llama-3-Instruct等部分模型已开放权重,但其对齐数据仍处于私有状态,这阻碍了人工智能的民主化进程。高昂的人力成本与受限的预定义提示范围,使得现有开源数据构建方法难以有效规模化,进而可能制约公开对齐数据集的多样性与质量。能否直接从已对齐的大语言模型中提取并规模化生成高质量指令数据?为此我们提出了一种用于规模化生成对齐数据的自合成方法,命名为Magpie。我们的核心观察在于:得益于自回归特性,诸如Llama-3-Instruct这类已对齐的大语言模型,仅需输入至用户消息预留位置的左侧模板,即可生成用户查询。我们利用该方法对Llama-3-Instruct进行提示,生成了400万条指令及其对应的响应内容。我们对提取得到的数据进行了全面分析,并从中筛选出30万条高质量样本。为将Magpie数据集与其他公开指令数据集进行对比,我们使用各数据集分别对Llama-3-8B-Base进行微调,并评估微调后模型的性能。我们的研究结果显示:在部分任务中,使用Magpie数据集微调得到的模型性能可与官方Llama-3-8B-Instruct相媲美——尽管后者通过1000万条数据的监督微调(Supervised Fine-Tuning,SFT)与后续反馈学习进行了优化。我们还证实,仅使用Magpie进行监督微调,其效果便可超越此前同时用于监督微调与偏好优化的公开数据集,例如结合UltraFeedback的直接偏好优化(Direct Preference Optimization,DPO)数据集。该优势在AlpacaEval、ArenaHard与WildBench等对齐基准测试中表现显著。 </details><be> ## 数据集详情 本数据集由[Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)借助[Magpie](https://huggingface.co/Magpie-Align)生成。有关实现细节,请参阅我们的[论文](https://arxiv.org/abs/2406.08464)与[代码仓库](https://github.com/magpie-align/magpie)。 **许可协议**:请遵循[Meta Llama 3.1社区许可协议](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)。 ### 可用标签 - **输入长度**:指令中的总字符数。 - **输出长度**:响应中的总字符数。 - **任务类别**:指令所属的具体任务分类。 - **输入质量**:指令的清晰度、特异性与连贯性,评级分为「极差」「较差」「一般」「良好」与「优秀」。 - **输入难度**:完成指令描述的任务所需的知识门槛,评级分为「极简单」「简单」「中等」「困难」与「极困难」。 - **最小邻域距离**:数据集中与当前样本最近邻的嵌入空间距离,可用于过滤重复或高度相似的样本。 - **安全性**:由[meta-llama/Meta-Llama-Guard-2-8B](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B)标记的安全标签。 - **奖励得分**:针对特定指令-响应对的奖励模型输出值。 - **语言**:指令所使用的语言。 ## 筛选规则 我们注意到[Magpie-Llama-3.1-Pro-MT-500K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-500K-v0.1)包含大量思维链响应,而此类内容并非必需。因此在本数据集中,我们对包含`## Step 1`的样本进行了减量处理。 为构建该多轮对话数据集,我们首先按照以下规则对[Magpie-Llama-3.1-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-1M-v0.1)进行筛选: - **输入质量**:≥「良好」 - **指令奖励得分**:≥-10 - 移除重复与不完整的指令(例如以「:」结尾的样本) - 保留换行符数量小于5的指令(编码与调试类任务除外) - 选取响应长度最长的50万条样本 => [Magpie-Llama-3.1-Pro-500K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-500K-Filtered) 随后我们将这些样本扩展为多轮对话: => [Magpie-Llama-3.1-Pro-MT-500K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-500K-v0.1) 最终我们通过以下规则得到30万条样本的子集: - 移除不完整的第二轮指令(例如以「:」结尾的样本) - 减量处理响应中包含`## Step 1`的样本 ## 数据集导航 🧭 |模型名称 | 数据集 | 类型 | 描述 | |:-------|:-------|:-------|:-------| | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-1M-v0.1) | 监督微调(SFT) | 基于Meta Llama 3.1 70B构建的100万条原始对话数据。 | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) | 监督微调(SFT) | 经过筛选后选取的30万条高质量对话数据。 | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-500K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-500K-Filtered) | 监督微调(SFT) | 经过筛选后选取的50万条高质量对话数据。 | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-MT-500K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-500K-v0.1) | 监督微调(SFT) | 将Magpie-Llama-3.1-Pro-500K-Filtered扩展为多轮对话格式。 | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-MT-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-Filtered) | 监督微调(SFT) | 从Magpie-Llama-3.1-Pro-MT-500K中选取的30万条高质量多轮对话数据。 | [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) | [Magpie-Llama-3.1-Pro-DPO-100K](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1) | 直接偏好优化(DPO) | 基于Best-of-N采样与奖励机制构建的DPO数据集。
提供机构:
maas
创建时间:
2025-01-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作