five

DistilQwen_1M

收藏
魔搭社区2026-01-06 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/PAI/DistilQwen_1M
下载链接
链接失效反馈
官方服务:
资源简介:
# DistilQwen-1M: High-Quality Instruction-Tuning Dataset ## Overview To empower community developers in enhancing the **instruction-following capabilities** of large language models (LLMs), we open-source **`DistilQwen-1M`**, a distilled subset of the training data used for the **DistilQwen model series**. Alongside its smaller counterpart (`DistilQwen-100K`), this dataset provides diverse, high-quality samples to improve model performance in key areas. ## Dataset Features - **Scale**: **1 million** meticulously distilled entries. - **Coverage**: Balanced mix of: - **Mathematics** - **Code generation & understanding** - **Knowledge-based QA** - **Instruction following** - **Creative generation** - **Purpose**: Optimized for **instruction tuning**, helping models retain generalization while adapting to downstream tasks. ## Use Cases - **Fine-tuning LLMs**: Mitigate *catastrophic forgetting* by combining with custom datasets. - **Multi-task learning**: Improve coherence in mathematical reasoning, coding, and creative tasks. - **Research**: Study distillation techniques or instruction-tuning efficacy. ## Reference For more detailed information about the dataset construction process, we encourage you to refer to our paper: - **DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models** Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang [arXiv:2504.15027](https://arxiv.org/abs/2504.15027) You can cite the paper using the following citation format: ```bibtex @misc{wang2025distilqwen25industrialpracticestraining, title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang}, year={2025}, eprint={2504.15027}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.15027} } ```

# DistilQwen-1M:高质量指令微调数据集 ## 概述 为赋能社区开发者提升大语言模型(Large Language Model,LLM)的指令遵循能力,我们开源了**`DistilQwen-1M`**——DistilQwen模型系列训练数据的蒸馏子集。与其轻量化版本(`DistilQwen-100K`)一同发布,本数据集提供了多样且高质量的样本,以助力模型在核心任务领域的性能提升。 ## 数据集特性 - **规模**:**100万**条经过精细蒸馏的样本条目。 - **覆盖范围**:均衡涵盖以下领域: - **数学** - **代码生成与理解** - **基于知识的问答(Knowledge-based QA)** - **指令遵循** - **创意生成** - **用途**:专为指令微调优化,可帮助模型在适配下游任务的同时保留泛化能力。 ## 应用场景 - **大语言模型微调**:结合自定义数据集可缓解灾难性遗忘问题。 - **多任务学习**:提升数学推理、代码开发与创意生成类任务的连贯性。 - **学术研究**:用于研究蒸馏技术或指令微调的有效性。 ## 参考文献 如需了解本数据集构建流程的详细信息,敬请参阅我们的学术论文: - **DistilQwen2.5:面向蒸馏开源轻量化语言模型训练的工业实践** Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang [arXiv:2504.15027](https://arxiv.org/abs/2504.15027) 您可通过以下引用格式引用该论文: bibtex @misc{wang2025distilqwen25industrialpracticestraining, title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang}, year={2025}, eprint={2504.15027}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.15027} }
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作