five

Kun

收藏
arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/m-a-p/COIG-Kun
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个高质量的大型语言模型(LLM)指令调优数据集,它采用了基于指令回译和答案精炼的自我训练算法。通过利用来自多样化来源的未标注数据,生成了超过一百万条中文教学数据点。该数据集的来源包括悟道、万卷和Skypile,涵盖了学术学科、行业领域和文本类型的多样化指令分类。规模上,该数据集包含了超过一百万个数据点,其任务旨在为大型语言模型进行指令调优。

This dataset is a high-quality large language model (LLM) instruction tuning dataset that adopts a self-training algorithm based on instruction back-translation and answer refinement. It generates over one million Chinese instructional data points by leveraging unlabeled data from diverse sources. The sources of the dataset include WuDao, Wanjuan, and Skypile, covering diverse instruction categories across academic disciplines, industrial domains, and text types. In terms of scale, this dataset contains over one million data points, which is specifically designed for instruction tuning of large language models.
提供机构:
Open-source community
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作