DistilQwen_1M

Name: DistilQwen_1M
Creator: maas
Published: 2026-01-06 16:33:37
License: 暂无描述

魔搭社区2026-01-06 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/PAI/DistilQwen_1M

下载链接

链接失效反馈

官方服务：

资源简介：

# DistilQwen-1M: High-Quality Instruction-Tuning Dataset ## Overview To empower community developers in enhancing the **instruction-following capabilities** of large language models (LLMs), we open-source **`DistilQwen-1M`**, a distilled subset of the training data used for the **DistilQwen model series**. Alongside its smaller counterpart (`DistilQwen-100K`), this dataset provides diverse, high-quality samples to improve model performance in key areas. ## Dataset Features - **Scale**: **1 million** meticulously distilled entries. - **Coverage**: Balanced mix of: - **Mathematics** - **Code generation & understanding** - **Knowledge-based QA** - **Instruction following** - **Creative generation** - **Purpose**: Optimized for **instruction tuning**, helping models retain generalization while adapting to downstream tasks. ## Use Cases - **Fine-tuning LLMs**: Mitigate *catastrophic forgetting* by combining with custom datasets. - **Multi-task learning**: Improve coherence in mathematical reasoning, coding, and creative tasks. - **Research**: Study distillation techniques or instruction-tuning efficacy. ## Reference For more detailed information about the dataset construction process, we encourage you to refer to our paper: - **DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models** Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang [arXiv:2504.15027](https://arxiv.org/abs/2504.15027) You can cite the paper using the following citation format: ```bibtex @misc{wang2025distilqwen25industrialpracticestraining, title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang}, year={2025}, eprint={2504.15027}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.15027} } ```

# DistilQwen-1M：高质量指令微调数据集 ## 概述为赋能社区开发者提升大语言模型（Large Language Model，LLM）的指令遵循能力，我们开源了**`DistilQwen-1M`**——DistilQwen模型系列训练数据的蒸馏子集。与其轻量化版本（`DistilQwen-100K`）一同发布，本数据集提供了多样且高质量的样本，以助力模型在核心任务领域的性能提升。 ## 数据集特性 - **规模**：**100万**条经过精细蒸馏的样本条目。 - **覆盖范围**：均衡涵盖以下领域： - **数学** - **代码生成与理解** - **基于知识的问答（Knowledge-based QA）** - **指令遵循** - **创意生成** - **用途**：专为指令微调优化，可帮助模型在适配下游任务的同时保留泛化能力。 ## 应用场景 - **大语言模型微调**：结合自定义数据集可缓解灾难性遗忘问题。 - **多任务学习**：提升数学推理、代码开发与创意生成类任务的连贯性。 - **学术研究**：用于研究蒸馏技术或指令微调的有效性。 ## 参考文献如需了解本数据集构建流程的详细信息，敬请参阅我们的学术论文： - **DistilQwen2.5：面向蒸馏开源轻量化语言模型训练的工业实践** Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang [arXiv:2504.15027](https://arxiv.org/abs/2504.15027) 您可通过以下引用格式引用该论文： bibtex @misc{wang2025distilqwen25industrialpracticestraining, title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang}, year={2025}, eprint={2504.15027}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.15027} }

提供机构：

maas

创建时间：

2025-05-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集