alibaba-pai/DistilQwen_1M

Name: alibaba-pai/DistilQwen_1M
Creator: alibaba-pai
Published: 2025-05-24 09:42:34
License: 暂无描述

Hugging Face2025-05-24 更新2025-05-31 收录

下载链接：

https://hf-mirror.com/datasets/alibaba-pai/DistilQwen_1M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: features: - name: instruct dtype: string - name: output dtype: string splits: - name: train num_bytes: 5352504933 num_examples: 2311632 download_size: 2773269443 dataset_size: 5352504933 configs: - config_name: default data_files: - split: train path: data/train-* --- # DistilQwen-1M: High-Quality Instruction-Tuning Dataset ## Overview To empower community developers in enhancing the **instruction-following capabilities** of large language models (LLMs), we open-source **`DistilQwen-1M`**, a distilled subset of the training data used for the **DistilQwen model series**. Alongside its smaller counterpart (`DistilQwen-100K`), this dataset provides diverse, high-quality samples to improve model performance in key areas. ## Dataset Features - **Scale**: **1 million** meticulously distilled entries. - **Coverage**: Balanced mix of: - **Mathematics** - **Code generation & understanding** - **Knowledge-based QA** - **Instruction following** - **Creative generation** - **Purpose**: Optimized for **instruction tuning**, helping models retain generalization while adapting to downstream tasks. ## Use Cases - **Fine-tuning LLMs**: Mitigate *catastrophic forgetting* by combining with custom datasets. - **Multi-task learning**: Improve coherence in mathematical reasoning, coding, and creative tasks. - **Research**: Study distillation techniques or instruction-tuning efficacy. ## Reference For more detailed information about the dataset construction process, we encourage you to refer to our paper: - **DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models** Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang [arXiv:2504.15027](https://arxiv.org/abs/2504.15027) You can cite the paper using the following citation format: ```bibtex @misc{wang2025distilqwen25industrialpracticestraining, title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang}, year={2025}, eprint={2504.15027}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.15027} } ```

DistilQwen-1M is a distilled subset of 1 million meticulously selected entries designed to enhance the instruction-following capabilities of large language models. It covers a balanced mix of areas including mathematics, code generation & understanding, knowledge-based QA, instruction following, and creative generation. The dataset is optimized for instruction tuning to help models maintain generalization while adapting to downstream tasks.

提供机构：

alibaba-pai

5,000+

优质数据集

54 个

任务类型

进入经典数据集