five

alibaba-pai/DistilQwen_1M

收藏
Hugging Face2025-05-24 更新2025-05-31 收录
下载链接:
https://hf-mirror.com/datasets/alibaba-pai/DistilQwen_1M
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: features: - name: instruct dtype: string - name: output dtype: string splits: - name: train num_bytes: 5352504933 num_examples: 2311632 download_size: 2773269443 dataset_size: 5352504933 configs: - config_name: default data_files: - split: train path: data/train-* --- # DistilQwen-1M: High-Quality Instruction-Tuning Dataset ## Overview To empower community developers in enhancing the **instruction-following capabilities** of large language models (LLMs), we open-source **`DistilQwen-1M`**, a distilled subset of the training data used for the **DistilQwen model series**. Alongside its smaller counterpart (`DistilQwen-100K`), this dataset provides diverse, high-quality samples to improve model performance in key areas. ## Dataset Features - **Scale**: **1 million** meticulously distilled entries. - **Coverage**: Balanced mix of: - **Mathematics** - **Code generation & understanding** - **Knowledge-based QA** - **Instruction following** - **Creative generation** - **Purpose**: Optimized for **instruction tuning**, helping models retain generalization while adapting to downstream tasks. ## Use Cases - **Fine-tuning LLMs**: Mitigate *catastrophic forgetting* by combining with custom datasets. - **Multi-task learning**: Improve coherence in mathematical reasoning, coding, and creative tasks. - **Research**: Study distillation techniques or instruction-tuning efficacy. ## Reference For more detailed information about the dataset construction process, we encourage you to refer to our paper: - **DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models** Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang [arXiv:2504.15027](https://arxiv.org/abs/2504.15027) You can cite the paper using the following citation format: ```bibtex @misc{wang2025distilqwen25industrialpracticestraining, title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang}, year={2025}, eprint={2504.15027}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.15027} } ```

DistilQwen-1M is a distilled subset of 1 million meticulously selected entries designed to enhance the instruction-following capabilities of large language models. It covers a balanced mix of areas including mathematics, code generation & understanding, knowledge-based QA, instruction following, and creative generation. The dataset is optimized for instruction tuning to help models maintain generalization while adapting to downstream tasks.
提供机构:
alibaba-pai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作