Kun

Name: Kun
Creator: Open-source community
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://huggingface.co/datasets/m-a-p/COIG-Kun

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个高质量的大型语言模型（LLM）指令调优数据集，它采用了基于指令回译和答案精炼的自我训练算法。通过利用来自多样化来源的未标注数据，生成了超过一百万条中文教学数据点。该数据集的来源包括悟道、万卷和Skypile，涵盖了学术学科、行业领域和文本类型的多样化指令分类。规模上，该数据集包含了超过一百万个数据点，其任务旨在为大型语言模型进行指令调优。

This dataset is a high-quality large language model (LLM) instruction tuning dataset that adopts a self-training algorithm based on instruction back-translation and answer refinement. It generates over one million Chinese instructional data points by leveraging unlabeled data from diverse sources. The sources of the dataset include WuDao, Wanjuan, and Skypile, covering diverse instruction categories across academic disciplines, industrial domains, and text types. In terms of scale, this dataset contains over one million data points, which is specifically designed for instruction tuning of large language models.

提供机构：

Open-source community

5,000+

优质数据集

54 个

任务类型

进入经典数据集