DistilQwen_1M
收藏魔搭社区2026-01-06 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/PAI/DistilQwen_1M
下载链接
链接失效反馈官方服务:
资源简介:
# DistilQwen-1M: High-Quality Instruction-Tuning Dataset
## Overview
To empower community developers in enhancing the **instruction-following capabilities** of large language models (LLMs), we open-source **`DistilQwen-1M`**, a distilled subset of the training data used for the **DistilQwen model series**. Alongside its smaller counterpart (`DistilQwen-100K`), this dataset provides diverse, high-quality samples to improve model performance in key areas.
## Dataset Features
- **Scale**: **1 million** meticulously distilled entries.
- **Coverage**: Balanced mix of:
- **Mathematics**
- **Code generation & understanding**
- **Knowledge-based QA**
- **Instruction following**
- **Creative generation**
- **Purpose**: Optimized for **instruction tuning**, helping models retain generalization while adapting to downstream tasks.
## Use Cases
- **Fine-tuning LLMs**: Mitigate *catastrophic forgetting* by combining with custom datasets.
- **Multi-task learning**: Improve coherence in mathematical reasoning, coding, and creative tasks.
- **Research**: Study distillation techniques or instruction-tuning efficacy.
## Reference
For more detailed information about the dataset construction process, we encourage you to refer to our paper:
- **DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models**
Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang
[arXiv:2504.15027](https://arxiv.org/abs/2504.15027)
You can cite the paper using the following citation format:
```bibtex
@misc{wang2025distilqwen25industrialpracticestraining,
title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models},
author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang},
year={2025},
eprint={2504.15027},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.15027}
}
```
# DistilQwen-1M:高质量指令微调数据集
## 概述
为赋能社区开发者提升大语言模型(Large Language Model,LLM)的指令遵循能力,我们开源了**`DistilQwen-1M`**——DistilQwen模型系列训练数据的蒸馏子集。与其轻量化版本(`DistilQwen-100K`)一同发布,本数据集提供了多样且高质量的样本,以助力模型在核心任务领域的性能提升。
## 数据集特性
- **规模**:**100万**条经过精细蒸馏的样本条目。
- **覆盖范围**:均衡涵盖以下领域:
- **数学**
- **代码生成与理解**
- **基于知识的问答(Knowledge-based QA)**
- **指令遵循**
- **创意生成**
- **用途**:专为指令微调优化,可帮助模型在适配下游任务的同时保留泛化能力。
## 应用场景
- **大语言模型微调**:结合自定义数据集可缓解灾难性遗忘问题。
- **多任务学习**:提升数学推理、代码开发与创意生成类任务的连贯性。
- **学术研究**:用于研究蒸馏技术或指令微调的有效性。
## 参考文献
如需了解本数据集构建流程的详细信息,敬请参阅我们的学术论文:
- **DistilQwen2.5:面向蒸馏开源轻量化语言模型训练的工业实践**
Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang
[arXiv:2504.15027](https://arxiv.org/abs/2504.15027)
您可通过以下引用格式引用该论文:
bibtex
@misc{wang2025distilqwen25industrialpracticestraining,
title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models},
author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang},
year={2025},
eprint={2504.15027},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.15027}
}
提供机构:
maas
创建时间:
2025-05-27



