Magpie-Air-MT-300K-v0.1

Name: Magpie-Air-MT-300K-v0.1
Creator: maas
Published: 2025-11-19 17:02:28
License: 暂无描述

魔搭社区2025-11-19 更新2025-01-18 收录

下载链接：

https://modelscope.cn/datasets/Magpie-Align/Magpie-Air-MT-300K-v0.1

下载链接

链接失效反馈

官方服务：

资源简介：

![Magpie](magpie_logo.png) Project Web: [https://magpie-align.github.io/](https://magpie-align.github.io/) Arxiv Technical Report: [https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464) Codes: [https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie) ## Abstract <details><summary>Click Here</summary> High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench. </details><be> ## Dataset Details This dataset is generated by [Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using [Magpie](https://huggingface.co/Magpie-Align). Please refer to our [paper](https://arxiv.org/abs/2406.08464) and [codebase](https://github.com/magpie-align/magpie) for implementation details. This is the filtered data with a multi-turn extension. Please see below for the filter design. Please do not use **Magpie-Air-300K-Filtered** and **Magpie-Air-MT-300K** to fine-tune the model simultaneously as they are largely the same for the first turn! You can find the model fine-tuned using this dataset [here](https://huggingface.co/Magpie-Align/Llama-3-8B-Magpie-Air-MT-SFT-v0.1). ## Filter Setups - **Input Quality**: >= good - **Input Difficulty**: >= medium - **Reward difference**: >= 0 - Remove repetition and incomplete instructions (e.g., end with :) - Choose 300K data with the longest responses ## Dataset Navigation 🧭 |Model Name | Dataset | Type | Description | |-------------|:-------|:-------|:-------| | [Llama 3 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | [Magpie-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1) | SFT | 1M Raw conversations built with Meta Llama 3 70B. | [Llama 3 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | [Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered) | SFT | Apply a filter and select 300K high quality conversations. | [Llama 3 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | [Magpie-Pro-MT-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-MT-300K-v0.1) | SFT | Select 300K difficult questions and extend to multi-turn conversations. | [Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [Magpie-Air-3M](https://huggingface.co/datasets/Magpie-Align/Llama-3-Magpie-Air-3M-v0.1) | SFT | 3M Raw conversations built with Meta Llama 3 8B. | [Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [Magpie-Air-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-300K-Filtered) | SFT | Apply a filter and select 300K high quality data. | [Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [Magpie-Air-MT-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-MT-300K-v0.1) | SFT | Select 300K difficult questions and extend to multi-turn conversations.

![Magpie](magpie_logo.png) 项目官网：[https://magpie-align.github.io/](https://magpie-align.github.io/) Arxiv技术报告：[https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464) 代码仓库：[https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie) ## 摘要 <details><summary>点击展开</summary> 高质量的指令微调数据对于对齐大语言模型（Large Language Model，LLM）至关重要。尽管部分模型（如Llama-3-Instruct）已开源权重，但其对齐数据仍未公开，这阻碍了人工智能的民主化进程。高昂的人力成本与预设的提示范围局限，使得现有的开源数据构建方法难以有效扩展，进而可能限制了公开对齐数据集的多样性与质量。能否通过直接从已对齐的大语言模型中提取数据，来大规模生成高质量的指令微调数据？为此我们提出了一种名为Magpie的大规模对齐数据自合成方法。我们的核心观察在于：得益于自回归特性，像Llama-3-Instruct这类已对齐的大语言模型，仅需输入至用户消息预留位置的左侧模板，即可生成用户查询内容。我们借助该方法对Llama-3-Instruct进行提示，生成了400万条指令及其对应的响应内容。我们对提取得到的数据进行了全面分析，并从中选取了30万条高质量样本。为了将Magpie数据集与其他公开指令数据集进行对比，我们分别使用各数据集对Llama-3-8B-Base进行微调，并评估微调后模型的性能。实验结果表明，在部分任务中，使用Magpie数据集微调得到的模型性能可与官方Llama-3-8B-Instruct相媲美——尽管后者通过监督微调（Supervised Fine-Tuning，SFT）与后续反馈学习，使用了1000万条数据进行增强。我们还证实，仅使用Magpie数据集进行监督微调，其性能可超越此前同时用于监督微调与偏好优化的公开数据集（如结合UltraFeedback的直接偏好优化数据集）。这一优势在AlpacaEval、ArenaHard与WildBench等对齐基准测试中表现显著。 </details> ## 数据集详情本数据集由[Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)借助[Magpie](https://huggingface.co/Magpie-Align)生成。如需了解实现细节，请参阅我们的[论文](https://arxiv.org/abs/2406.08464)与[代码仓库](https://github.com/magpie-align/magpie)。本数据集为经过筛选并支持多轮对话扩展的版本。筛选设计详见下文。请勿同时使用**Magpie-Air-300K-Filtered**与**Magpie-Air-MT-300K**进行模型微调，二者的首轮对话内容高度重合！您可通过[此处](https://huggingface.co/Magpie-Align/Llama-3-8B-Magpie-Air-MT-SFT-v0.1)获取使用本数据集微调得到的模型。 ## 筛选设置 - **输入质量**：≥ 良好 - **输入难度**：≥ 中等 - **奖励分差**：≥ 0 - 移除重复与不完整的指令（例如以冒号结尾的内容） - 选取响应长度最长的30万条数据 ## 数据集导航 🧭 | 模型名称 | 数据集 | 类型 | 描述 | | :------- | :------- | :------- | :------- | | [Llama 3 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | [Magpie-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Llama-3-Magpie-Pro-1M-v0.1) | 监督微调（SFT） | 基于Meta Llama 3 70B构建的100万条原始对话数据 | | [Llama 3 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | [Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered) | 监督微调（SFT） | 经过筛选并选取30万条高质量对话数据 | | [Llama 3 70B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) | [Magpie-Pro-MT-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-MT-300K-v0.1) | 监督微调（SFT） | 选取30万条高难度问题并扩展为多轮对话数据 | | [Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [Magpie-Air-3M](https://huggingface.co/datasets/Magpie-Align/Llama-3-Magpie-Air-3M-v0.1) | 监督微调（SFT） | 基于Meta Llama 3 8B构建的300万条原始对话数据 | | [Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [Magpie-Air-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-300K-Filtered) | 监督微调（SFT） | 经过筛选并选取30万条高质量数据 | | [Llama 3 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [Magpie-Air-MT-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Air-MT-300K-v0.1) | 监督微调（SFT） | 选取30万条高难度问题并扩展为多轮对话数据 |

提供机构：

maas

创建时间：

2025-01-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集