Magpie-Llama-3.1-70B-Instruct-Filtered-1M

Name: Magpie-Llama-3.1-70B-Instruct-Filtered-1M
Creator: HiTZ zentroa
Published: 2025-05-23 23:16:02
License: 暂无描述

Hugging Face2025-05-23 更新2025-05-24 收录

下载链接：

https://huggingface.co/datasets/HiTZ/Magpie-Llama-3.1-70B-Instruct-Filtered-1M

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集使用meta-llama/Llama-3.1-70B-Instruct模型和MAGPIE代码库生成。数据集通过特定的过滤标准进行了筛选，包括输入质量、指导奖励以及文本中的重复性检查。系统提示包括通用提示以及针对编码、数学、算术和机器翻译任务的特定提示。

This dataset was generated using the meta-llama/Llama-3.1-70B-Instruct model and the MAGPIE codebase. It was filtered through specific screening criteria, including input quality, instruction reward, and repetition checks within the text. The system prompts include general prompts as well as specialized prompts for coding, mathematics, arithmetic, and machine translation tasks.

提供机构：

HiTZ zentroa

创建时间：

2025-05-23

原始信息汇总

数据集概述

基本信息

许可证: Apache-2.0
语言: 英语 (en)
标签: 合成数据 (synthetic)

数据集生成

生成工具: 使用 meta-llama/Llama-3.1-70B-Instruct 和 MAGPIE代码库生成。
未过滤数据集: 可在此处获取 /HiTZ/Magpie-Llama-3.1-70B-Instruct-Unfiltered。

过滤标准

python min_repetition = 100

def test_no_repetition(text: str): # Count the frequency of each word in the text word_count = Counter(text.split()) # Check if any word appears more than min_repetition times return all(count <= min_repetition for count in word_count.values())

def high_quality_filter(example): return ( example["input_quality"] in ["good", "excellent", "average"] and example["instruct_reward"] > -10 and not example["instruction"].endswith(":") and ( example["min_similar_conversation_id"] is None or example["conversation_id"] == example["min_similar_conversation_id"] ) and test_no_repetition(example["response"]) )

系统提示

通用提示

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023 Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

代码提示

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an AI assistant designed to provide helpful, step-by-step guidance on coding problems. The user will ask you a wide range of coding questions. Your purpose is to assist users in understanding coding concepts, working through code, and arriving at the correct solutions.<|eot_id|><|start_header_id|>user<|end_header_id|>

数学提示

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an AI assistant designed to provide helpful, step-by-step guidance on solving math problems. The user will ask you a wide range of complex mathematical questions. Your purpose is to assist users in understanding mathematical concepts, working through equations, and arriving at the correct solutions.<|eot_id|><|start_header_id|>user<|end_header_id|>

算术提示

You are an AI assistant designed to provide helpful, step-by-step guidance on solving complex arithmetic operations. The user will provide you with an arithmetic operation or a concatenation of multiple arithmetic operations. Your purpose is to assist users in computing the results of the arithmetic operation exlaining the process step by step.<|eot_id|><|start_header_id|>user<|end_header_id|>

机器翻译提示

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an AI assistant specifically designed to provide accurate and contextually appropriate translations. Users will ask you to translate a large text between various languages. Your purpose is to translate the text, maintaining the original context and nuances.<|eot_id|><|start_header_id|>user<|end_header_id|>

搜集汇总

数据集介绍

构建方式

该数据集通过先进的合成数据生成技术构建，基于meta-llama/Llama-3.1-70B-Instruct模型与MAGPIE代码库生成。构建过程中采用了多阶段过滤机制，包括输入质量筛选（仅保留good、excellent、average等级别）、指令奖励阈值（instruct_reward > -10）、指令格式校验（排除以冒号结尾的指令）以及响应重复性检测（确保无单词重复超过100次）。同时通过对话相似性比对保证数据唯一性，最终形成精炼的高质量数据集。

特点

数据集展现出显著的多领域覆盖特性，包含通用对话、编程指导、数学问题求解、算术运算及机器翻译五大专项领域。每个领域配备定制化的系统提示模板，确保生成内容符合专业语境要求。其核心优势在于严格的质量控制体系，通过量化指标与规则过滤相结合的方式，在保持数据多样性的同时有效剔除低质量样本，为AI训练提供了语义丰富且逻辑连贯的指令-响应对。

使用方法

该数据集适用于大规模语言模型的指令微调与性能增强，用户可通过HuggingFace平台直接加载过滤后的1M高质量样本。使用时应根据具体应用场景选择相应领域数据（如coding、math等），并注意系统提示模板的完整性保留。对于需要更高数据量的研究，可参考未过滤版本数据集进行扩展，但需自行实施额外质量控制。建议在模型训练前进行领域分布分析，以优化不同任务类型的数据采样比例。

背景与挑战

背景概述

Magpie-Llama-3.1-70B-Instruct-Filtered-1M数据集是由HiTZ研究团队基于Meta公司发布的Llama-3.1-70B-Instruct模型，利用MAGPIE代码框架生成的合成数据集。该数据集聚焦于提升大规模语言模型在多样化任务中的指令遵循能力，涵盖通用对话、代码生成、数学问题求解及机器翻译等多个领域。通过严格的质量过滤机制，该数据集为语言模型的微调与评估提供了高质量的语料资源，对推动开放域对话系统的研究具有重要意义。

当前挑战

该数据集面临的核心挑战主要体现在两个方面：在领域问题层面，如何确保生成内容在多样性、准确性与实用性之间的平衡，特别是在代码生成和数学推理等专业领域，需要克服模型幻觉和逻辑一致性等难题；在构建过程层面，设计有效的过滤标准以剔除低质量、重复或无关内容，同时保留语义丰富且符合指令要求的样本，是数据集质量控制的关键挑战。此外，系统提示词的设计需精准匹配不同任务类型的需求，这对数据集的适用性提出了更高要求。

常用场景

经典使用场景

在自然语言处理领域，Magpie-Llama-3.1-70B-Instruct-Filtered-1M数据集因其高质量的合成数据特性，常被用于训练和评估大规模语言模型。该数据集通过Llama-3.1-70B-Instruct模型生成，并经过严格的过滤标准筛选，确保了数据的多样性和准确性。研究人员通常利用该数据集来优化模型的指令遵循能力、文本生成质量以及多轮对话的连贯性。

解决学术问题

该数据集有效解决了语言模型训练中数据质量参差不齐的学术难题。通过设定输入质量、指令奖励和重复性检测等多重过滤标准，显著提升了合成数据的可靠性。其意义在于为学术界提供了一个标准化的大规模高质量指令数据集，推动了对话系统、代码生成和数学推理等领域的研究进展，尤其在模型泛化能力和少样本学习方面具有重要影响。

衍生相关工作

围绕该数据集衍生的经典工作主要集中在三个方面：基于过滤标准的自动化数据清洗框架开发、多模态指令数据的扩展研究，以及低资源语言的任务适应技术。例如Magpie代码库的迭代优化推动了合成数据质量评估指标的标准化，而相关研究团队正在探索将过滤机制应用于视觉-语言联合建模领域。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集