five

Magpie-Qwen2.5-Math-Pro-300K-v0.1

收藏
魔搭社区2025-12-03 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/Magpie-Align/Magpie-Qwen2.5-Math-Pro-300K-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
![Magpie](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/FWWILXrAGNwWr52aghV0S.png) Project Web: [https://magpie-align.github.io/](https://magpie-align.github.io/) Arxiv Technical Report: [https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464) Codes: [https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie) ## Abstract <details><summary>Click Here</summary> High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench. </details><be> ## Dataset Details This dataset is generated by [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) using [Magpie](https://huggingface.co/Magpie-Align). Please refer to our [paper](https://arxiv.org/abs/2406.08464) and [codebase](https://github.com/magpie-align/magpie) for implementation details. ### Available Labels - **Input Length**: The total number of characters in the instructions. - **Output Length**: The total number of characters in the responses. - **Task Category**: The specific category of the instructions. - **Input Quality**: The clarity, specificity, and coherence of the instructions, rated as 'very poor', 'poor', 'average', 'good', and 'excellent'. - **Input Difficulty**: The level of knowledge required to address the task described in the instruction, rated as 'very easy', 'easy', 'medium', 'hard', or 'very hard'. - **Minimum Neighbor Distance**: The embedding distance to the nearest neighbor within the dataset. It can be used for filtering out repetitive or similar instances. - **Safety**: Safety tags marked by [meta-llama/Meta-Llama-Guard-2-8B](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B) - **Reward**: The output of the reward model given the specific instruction-response pair. - **Language**: The language of the instruction. ## Dataset Navigation 🧭 |Model Name | Dataset | Type | Description | |-------------|:-------|:-------|:-------| | [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | [Magpie-Qwen2.5-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1) | SFT | 1M Raw conversations built with Qwen2.5 72B Instruct. | [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | [Magpie-Qwen2.5-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Pro-300K-Filtered) | SFT | Apply a filter and select 300K high quality conversations. | [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) | [Magpie-Qwen2.5-Math-Pro-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Math-Pro-300K-v0.1) | SFT | 300K Raw conversations built with Qwen2.5 Math 72B Instruct. | [Qwen2.5 Coder 32B Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) | [Magpie-Qwen2.5-Coder-Pro-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Coder-Pro-300K-v0.1) | SFT | 300K Raw conversations built with Qwen2.5 Coder 32B Instruct. | [Qwen2 72B Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | [Magpie-Qwen2-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-300K-Filtered) | SFT | Apply a filter and select 300K high quality conversations. | [Qwen2 72B Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese) | SFT | Apply a filter and select 200K high quality Chinese conversations. | [Qwen2 72B Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | [Magpie-Qwen2-Pro-200K-English](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-English) | SFT | Apply a filter and select 200K high quality English conversations.

![Magpie](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/FWWILXrAGNwWr52aghV0S.png) 项目官网:[https://magpie-align.github.io/](https://magpie-align.github.io/) arXiv技术报告:[https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464) 代码仓库:[https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie) ## 摘要 <details><summary>点击展开</summary> 高质量指令数据对于对齐大语言模型(Large Language Model,LLM)至关重要。尽管部分模型(如Llama-3-Instruct)已开放权重,但其对齐数据仍属私有,这阻碍了人工智能的民主化进程。现有的开源数据构建方法面临人工成本高昂、提示范围有限且预定义的问题,难以有效扩展,这可能限制了公开对齐数据集的多样性与质量。我们能否直接从已对齐的大语言模型中提取数据,以大规模合成高质量的指令数据? 我们提出了一种名为Magpie的大规模对齐数据自生成方法。我们的核心发现是,得益于自回归特性,仅输入至用户消息预留位置的左侧模板时,Llama-3-Instruct等已对齐的大语言模型即可生成用户查询。我们利用该方法对Llama-3-Instruct进行提示,生成了400万条指令及其对应的回复。我们对提取的数据进行了全面分析,并筛选出30万个高质量实例。 为了将Magpie数据集与其他公开指令数据集进行对比,我们使用每个数据集对Llama-3-8B-Base进行微调,并评估微调后模型的性能。我们的结果表明,在部分任务中,使用Magpie数据集微调的模型性能可与官方的Llama-3-8B-Instruct相媲美——尽管后者通过监督微调(Supervised Fine-Tuning,SFT)与后续反馈学习使用了1000万条数据进行增强。我们还证明,仅使用Magpie数据集进行SFT的效果,可超越此前用于SFT与偏好优化(如结合UltraFeedback的直接偏好优化)的公开数据集。这一优势在AlpacaEval、ArenaHard与WildBench等对齐基准测试中尤为显著。 </details> <be> ## 数据集详情 本数据集由[Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct)通过[Magpie](https://huggingface.co/Magpie-Align)生成,请参阅我们的[论文](https://arxiv.org/abs/2406.08464)与[代码库](https://github.com/magpie-align/magpie)了解实现细节。 ### 可用标签 - **输入长度**:指令的总字符数。 - **输出长度**:回复的总字符数。 - **任务类别**:指令所属的具体类别。 - **输入质量**:指令的清晰度、特异性与连贯性,评级分为「极差」「较差」「中等」「良好」「优秀」。 - **输入难度**:完成指令描述任务所需的知识水平,评级分为「极简单」「简单」「中等」「困难」「极困难」。 - **最小近邻距离**:数据集中与当前实例最近邻的嵌入距离,可用于过滤重复或相似实例。 - **安全性**:由[meta-llama/Meta-Llama-Guard-2-8B](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B)标记的安全标签。 - **奖励分数**:针对特定指令-回复对的奖励模型输出结果。 - **语言**:指令所使用的语言。 ## 数据集导航 🧭 | 模型名称 | 数据集 | 类型 | 描述 | |:-------|:-------|:-------|:-------| | [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | [Magpie-Qwen2.5-Pro-1M](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1) | SFT | 基于Qwen2.5 72B Instruct构建的100万条原始对话数据。 | [Qwen2.5 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | [Magpie-Qwen2.5-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Pro-300K-Filtered) | SFT | 经过筛选后选取的30万个高质量对话数据。 | [Qwen2.5 Math 72B Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct) | [Magpie-Qwen2.5-Math-Pro-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Math-Pro-300K-v0.1) | SFT | 基于Qwen2.5 Math 72B Instruct构建的30万个原始对话数据。 | [Qwen2.5 Coder 32B Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) | [Magpie-Qwen2.5-Coder-Pro-300K](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Coder-Pro-300K-v0.1) | SFT | 基于Qwen2.5 Coder 32B Instruct构建的30万个原始对话数据。 | [Qwen2 72B Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | [Magpie-Qwen2-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-300K-Filtered) | SFT | 经过筛选后选取的30万个高质量对话数据。 | [Qwen2 72B Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese) | SFT | 经过筛选后选取的20万个高质量中文对话数据。 | [Qwen2 72B Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | [Magpie-Qwen2-Pro-200K-English](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-English) | SFT | 经过筛选后选取的20万个高质量英文对话数据。
提供机构:
maas
创建时间:
2025-01-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作