likaixin/InstructCoder

Name: likaixin/InstructCoder
Creator: likaixin
Published: 2024-02-28 12:13:18
License: 暂无描述

Hugging Face2024-02-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/likaixin/InstructCoder

下载链接

链接失效反馈

官方服务：

资源简介：

InstructCoder是第一个旨在使大型语言模型（LLMs）适应通用代码编辑的数据集。它包含114,239个指令-输入-输出的三元组，覆盖了20种不同的代码编辑场景，这些场景由ChatGPT生成。经过InstructCoder微调的LLaMA-33B模型在真实世界的测试集上表现与ChatGPT相当。数据集通过系统化的迭代过程收集，初始任务来自GitHub提交，并通过Self-Instruct方法引导ChatGPT生成新的指令。高质量的样本经过人工筛选并递归地添加到任务池中以进一步生成。

InstructCoder is the first dataset designed to adapt Large Language Models (LLMs) to general code editing scenarios. It contains 114,239 instruction-input-output triplets, covering 20 distinct code editing scenarios generated by ChatGPT. The LLaMA-33B model fine-tuned on InstructCoder achieves performance comparable to ChatGPT on real-world test datasets. The dataset is collected via a systematic iterative process: initial tasks are sourced from GitHub commits, and new instructions are generated by prompting ChatGPT through the Self-Instruct method. High-quality samples are manually screened and recursively added to the task pool for further generation.

提供机构：

likaixin

原始信息汇总

数据集概述

名称： InstructCoder

目的： 用于适应大型语言模型（LLMs）进行通用代码编辑。

组成： 包含114,239个指令-输入-输出三元组。

覆盖范围： 涵盖多种不同的代码编辑场景。

生成方式： 由ChatGPT生成。

性能： LLaMA-33B模型经过InstructCoder微调后，在从GitHub提交中提取的真实世界测试集上，性能与ChatGPT相当。

5,000+

优质数据集

54 个

任务类型

进入经典数据集