five

COIG-Kun

收藏
魔搭社区2025-11-19 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/COIG-Kun
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <img src="Yi_logo.svg" width="150px" style="display: inline-block;"> <img src="m-a-p.png" width="150px" style="display: inline-block;"> </div> # Kun: Answer Polishment Saves Your Time for Using Intruction Backtranslation on Self-Alignment ## Table of Contents - [Overview](#overview) - [Dataset Description](#dataset-description) - [Usage](#usage) - [Citation](#citation) - [Acknowledgments](#acknowledgments) ## Overview The COIG-Kun dataset, part of the [COIG-Kun GitHub](https://github.com/Zheng0428/COIG-Kun) project, consists of instructional data used for training language models. This dataset was developed following the methodology inspired by Meta's "Self-Alignment with Instruction Backtranslation" and adapted for optimal performance in training label, point, and answer models. ## Dataset Description ### Language - The dataset contains instructions primarily in Chinese. ### Dataset Structure - **Data Instances**: Each data instance is structured in a JSON format with two fields: `instruction` and `output`. - Example: `{"instruction": "如何评价祁又一自编自导的电影《鸽子小姐》?", "output": "《鸽子小姐》是一部由祁又一自编自导的电影。..."}` - **Data Split**: The dataset is comprised of three subsets: - `wudao.jsonl`: 139,852 instances - `wanjuan.jsonl`: 328,294 instances - `skypile.jsonl`: 71,560 instances ### Data Characteristics - The dataset is designed to provide high-quality instructional data for language model training, focusing on enhancing the quality and applicability of the data. ## Methodology Our approach closely follows the self-alignment method ådescribed by Meta, with adaptations to optimize the process: 1. **Seed Data Selection and Model Training**: Initially, appropriate seed data are selected and inverted to train a Label Model on a base model(Yi Base). Concurrently, using the same seed data, a Primary Chat model is trained following the Supervised Fine-Tuning (SFT) method typical of chat models. 3. **Labeling Unlabeled Data**: The Label Model is then used to annotate preliminarily cleansed Primary data. Cleansing involves filtering based on perplexity (ppl) and length, discarding data exceeding 512 tokens. 4. **Instruction Data Generation**: Post-annotation, we obtain our first version of Labeled data. Unlike the original project where both instruction and output data pairs are fed into Primary Chat Model for scoring, our replication revealed limitations in Primary Chat's ability to discern high-quality instructions. We innovated by scoring only the instruction component, effectively filtering out noise and selecting high-quality instructions. 5. **Output Data Refinement**: Upon manual inspection, we identified a mismatch between the Primary Data (used as output) and the standard requirements for output in instruction data. To address this, we introduced an additional step: refining the output data. Using Primary Chat's capabilities, the output (originally unlabeled data) is adjusted according to the instructions, making it more suitable as output for the instruction data. 6. **Framework Completion**: Our methodology concludes with the acquisition of a substantial volume of instructional data, achieved with minimal resource expenditure. ![Project Framework](Kun_white.png) ## Usage ### Using the Data - The dataset can be used for training and fine-tuning language models, specifically focusing on instruction understanding and response generation. - Users are encouraged to refer to the project documentation for detailed instructions on utilizing the dataset in the training process. ## Citation If you use this dataset in your research, please cite it as follows: ```bibtex @misc{COIG-Kun, title={Kun: Answer Polishment Saves Your Time for Using Intruction Backtranslation on Self-Alignment}, author={Tianyu, Zheng* and Shuyue, Guo* and Xingwei, Qu and Xinrun, Du and Wenhu, Chen and Jie, Fu and Wenhao, Huang and Ge, Zhang}, year={2023}, publisher={GitHub}, journal={GitHub repository}, howpublished={https://github.com/Zheng0428/COIG-Kun} } ``` ## Acknowledgments This dataset was created by a dedicated team at [M-A-P]. We acknowledge the contributions of all individuals and organizations that made this project possible.

<div align="center"> <img src="Yi_logo.svg" width="150px" style="display: inline-block;"> <img src="m-a-p.png" width="150px" style="display: inline-block;"> </div> # Kun:答案打磨,为你省去自对齐任务中指令回译的耗时 ## 目录 - [概述](#overview) - [数据集详情](#dataset-description) - [使用方法](#usage) - [引用方式](#citation) - [致谢](#acknowledgments) ## 概述 COIG-Kun数据集是[COIG-Kun GitHub仓库](https://github.com/Zheng0428/COIG-Kun)项目的组成部分,包含用于训练大语言模型(Large Language Model)的指令数据。本数据集的研发借鉴了Meta提出的《基于指令回译的自对齐(Self-Alignment with Instruction Backtranslation)》方法,并针对标签模型(Label Model)、评分模型与答案模型的训练进行了适配优化,以实现最佳性能。 ## 数据集详情 ### 语言 - 数据集的指令文本以中文为主。 ### 数据集结构 - **数据实例**:每条数据实例采用JSON格式构建,包含`instruction`(指令)与`output`(输出)两个字段。 - 示例:`{"instruction": "如何评价祁又一自编自导的电影《鸽子小姐》?", "output": "《鸽子小姐》是一部由祁又一自编自导的电影。..."}` - **数据划分**:数据集包含三个子集: - `wudao.jsonl`:139,852条数据实例 - `wanjuan.jsonl`:328,294条数据实例 - `skypile.jsonl`:71,560条数据实例 ### 数据特性 - 本数据集旨在为大语言模型训练提供高质量的指令数据,重点提升数据的质量与适配性。 ## 研发流程 我们的方法严格遵循Meta提出的自对齐流程,并针对优化目标进行了适配调整: 1. **种子数据选择与模型训练**:首先筛选合适的种子数据,并对其进行回译处理,基于Yi Base基座模型训练得到标签模型(Label Model)。同时,使用相同的种子数据,按照聊天模型通用的有监督微调(Supervised Fine-Tuning,SFT)方法训练得到基础对话模型(Primary Chat)。 3. **未标注数据标注**:使用上述标签模型对初步清洗后的基础对话数据进行标注。数据清洗环节基于困惑度(perplexity,ppl)与文本长度进行筛选,丢弃超过512个Token的数据。 4. **指令数据生成**:标注完成后,我们得到了首批标注数据。与原项目中将指令与输出数据对一同输入基础对话模型进行评分的方式不同,我们的复现实验发现基础对话模型难以准确判别高质量指令。因此我们进行了创新,仅对指令部分进行评分,有效过滤了噪声数据并筛选出高质量指令。 5. **输出数据打磨**:通过人工检视,我们发现原本作为输出的基础对话数据与指令数据的输出标准存在不匹配。为解决这一问题,我们新增了输出数据打磨环节:借助基础对话模型的能力,将原本为未标注数据的输出内容按照对应指令进行调整,使其更适配作为指令数据的输出内容。 6. **流程闭环**:通过上述流程,我们最终获得了大量高质量指令数据,且整个过程的资源消耗极低。 ![项目框架图](Kun_white.png) ## 使用方法 ### 数据使用 - 本数据集可用于大语言模型的训练与微调,尤其适用于指令理解与回复生成任务。 - 建议用户参考项目文档,以获取数据集在训练流程中的详细使用指南。 ## 引用方式 若您在研究中使用本数据集,请按照以下格式引用: bibtex @misc{COIG-Kun, title={Kun: Answer Polishment Saves Your Time for Using Intruction Backtranslation on Self-Alignment}, author={Tianyu, Zheng* and Shuyue, Guo* and Xingwei, Qu and Xinrun, Du and Wenhu, Chen and Jie, Fu and Wenhao, Huang and Ge, Zhang}, year={2023}, publisher={GitHub}, journal={GitHub repository}, howpublished={https://github.com/Zheng0428/COIG-Kun} } ## 致谢 本数据集由[M-A-P]团队的核心成员倾力打造。我们向所有为该项目提供支持与贡献的个人与组织致以诚挚谢意。
提供机构:
maas
创建时间:
2024-04-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作