alibaba-pai/DistilQwen_1M
收藏Hugging Face2025-05-24 更新2025-05-31 收录
下载链接:
https://hf-mirror.com/datasets/alibaba-pai/DistilQwen_1M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
features:
- name: instruct
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 5352504933
num_examples: 2311632
download_size: 2773269443
dataset_size: 5352504933
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# DistilQwen-1M: High-Quality Instruction-Tuning Dataset
## Overview
To empower community developers in enhancing the **instruction-following capabilities** of large language models (LLMs), we open-source **`DistilQwen-1M`**, a distilled subset of the training data used for the **DistilQwen model series**. Alongside its smaller counterpart (`DistilQwen-100K`), this dataset provides diverse, high-quality samples to improve model performance in key areas.
## Dataset Features
- **Scale**: **1 million** meticulously distilled entries.
- **Coverage**: Balanced mix of:
- **Mathematics**
- **Code generation & understanding**
- **Knowledge-based QA**
- **Instruction following**
- **Creative generation**
- **Purpose**: Optimized for **instruction tuning**, helping models retain generalization while adapting to downstream tasks.
## Use Cases
- **Fine-tuning LLMs**: Mitigate *catastrophic forgetting* by combining with custom datasets.
- **Multi-task learning**: Improve coherence in mathematical reasoning, coding, and creative tasks.
- **Research**: Study distillation techniques or instruction-tuning efficacy.
## Reference
For more detailed information about the dataset construction process, we encourage you to refer to our paper:
- **DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models**
Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang
[arXiv:2504.15027](https://arxiv.org/abs/2504.15027)
You can cite the paper using the following citation format:
```bibtex
@misc{wang2025distilqwen25industrialpracticestraining,
title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models},
author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang},
year={2025},
eprint={2504.15027},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.15027}
}
```
DistilQwen-1M is a distilled subset of 1 million meticulously selected entries designed to enhance the instruction-following capabilities of large language models. It covers a balanced mix of areas including mathematics, code generation & understanding, knowledge-based QA, instruction following, and creative generation. The dataset is optimized for instruction tuning to help models maintain generalization while adapting to downstream tasks.
提供机构:
alibaba-pai



