codeparrot/self-instruct-starcoder

Name: codeparrot/self-instruct-starcoder
Creator: codeparrot
Published: 2023-10-23 12:13:18
License: 暂无描述

Hugging Face2023-10-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/codeparrot/self-instruct-starcoder

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: instruction dtype: string - name: output dtype: string - name: most_similar dtype: string - name: avg_similarity_score dtype: float64 splits: - name: curated num_bytes: 1937514 num_examples: 771 - name: raw num_bytes: 12969008 num_examples: 5003 - name: unique num_bytes: 786771 num_examples: 308 - name: compile num_bytes: 9048805 num_examples: 3549 download_size: 10935008 dataset_size: 24742098 tags: - code size_categories: - 1K<n<10K task_categories: - text2text-generation license: bigscience-openrail-m language: - en --- # Self-instruct-starcoder ## Table of Contents - [Summary](#summary) - [Our approach](#our-approach) - [Dataset generation](#dataset-generation) - [Dataset quality](#dataset-quality) - [Post-processing](#post-processing) - [Self-consistency](#self-consistency) - [Uniqueness](#uniqueness) - [Compile](#compile) - [Dataset structure](#dataset-structure) - [Space](#space) ## Summary Self-instruct-starcoder is a dataset that was generated by prompting starcoder to generate new instructions based on some human-written seed instructions. The underlying process is explained in the paper [self-instruct](https://arxiv.org/abs/2212.10560). This algorithm gave birth to famous machine generated datasets such as [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) and [Code Alpaca](https://github.com/sahil280114/codealpaca) which are two datasets obtained by prompting OpenAI `text-davinci-003` engine. ## Our approach While our method is similar to self-instruct and stanford alpaca, we included some relevant modifications to the pipeline to account for what we wanted. - Rather than using `text-davinci-003`, we chose to prompt [StarCoder](https://arxiv.org/abs/2305.06161) which is a 10x smaller LLM developed for code use cases. However, it is possible to use any decoder based LLM on the hub. - We changed our seed tasks in order to have the model generate code related tasks. We completed the seed tasks from code alpaca with 20 additional algorithm instructions. - We switched from the generation format `"instruction":` - `"input":` - `"output":` to the format `"instruction":` - `"output":` by concatenating each instruction and its input under the keyword `instruction`. We did so because the previous prompting format tended to make the model generate test cases as input and their solution as output, which is not what we wanted. - Finally, we incorporated the possibility to change the trigger word in the prompt. We thus replaced the `"instruction" :` keyword by `"Here is the correct solution to the problem ":` which resulted into much better generated instructions. ## Dataset generation The generation of the dataset was time consuming and we chose our parameters to limit the computational burden of our method. - Number of examples in context : 4 - 2 seed instructions - 2 machine generated instructions - Number of instructions to generate : 5000 - Stop words used in the generation : ["\n20", "20.", "20 ."] - Similarity threshold for rouge score : 0.7 ## Dataset quality StarCoder, while being a great model is not as capable as `text-davinci-003`. In the generation, the model quickly reach sort of a ceiling in terms of creativity. There are many instructions that are similar to each other, but it should not bother since they are not phrased the same. ## Post-processing Post-processing is an important part of the pipeline since it improves the quality of the dataset despite the fact that it implies getting rid of some examples. First we need to identify what we want to avoid : - A generated solution which does not answer to the corresponding instruction - An instruction that is too similar to another one. ### Self-consistency We imagined a process that we named **self-consistency**. The idea is to reverse-prompt the model to see if it can generate a sound instruction that corresponds to the solution (output) it is prompted with. This is a particularly difficult few-shot task, and unfortunately StarCoder does not perform incredibly well on it. With a few-shot parameters of `4` (all being seed tasks), the model is able to recover 1135 instructions out of 5003, which amount for 22.6% of the raw dataset. Fortunately, the inability for starcoder to generate instructions for some solutions does not mean we should get rid of them. For the solutions (outputs) with generated instructions, we can compare these with the ground truth. For that we can use [Sentence-BERT](https://arxiv.org/abs/1908.10084) because the comparison should focus the meaning rather than the word to word similarity ratio. We have about 771 instructions (~68%) with a similarity score >= 0.5 with their ground truth. These can be seen as high quality examples, they form the `curated` set. <p align="center"> <img src="https://huggingface.co/datasets/codeparrot/self-instruct-starcoder/resolve/main/output.png" alt="drawing" width="300", height="300"/> </p> ### Uniqueness Another approach that can be used to clean the raw dataset is to focus on distinct instructions. For a given instruction, we go through all the instructions generated before it to see if there is one with a similarity score >= 0.5. If it is the case, we remove that instruction. This process removes about 94% of the raw dataset, the remaining instructions form the `unique` set. ### Compile We also decided to build a set which contains solely the example featuring a code written in python 3 which does not code a compilation error. ## Dataset structure ```python from datasets import load_dataset dataset = load_dataset("codeparrot/self-instruct-starcoder") DatasetDict({ compile: Dataset({ features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'], num_rows: 3549 }) curated: Dataset({ features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'], num_rows: 771 }) raw: Dataset({ features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'], num_rows: 5003 }) unique: Dataset({ features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'], num_rows: 308 }) })) ``` |Field|Type|Description| |---|---|---| |instruction|string|Instruction| |output|string|Answer to the instruction| |most_similar|string|Dictionnary containing the 10 most similar instructions generated before the current instruction along with the similarity scores| |avg_similarity_score|float64| Average similarity score| ## Additional resources - [Space(self-instruct-starcoder)](https://huggingface.co/spaces/codeparrot/self-instruct-starcoder) - [Github Repository](https://github.com/ArmelRandy/Self-instruct) ## Citation ``` @misc{title={Self-Instruct-StarCoder}, author={Zebaze, Armel Randy}, doi={https://doi.org/10.57967/hf/0790}, } ```

提供机构：

codeparrot

原始信息汇总

数据集概述

数据集信息

特征

instruction: 字符串类型，表示指令。
output: 字符串类型，表示指令的输出。
most_similar: 字符串类型，包含与当前指令最相似的10个指令及其相似度得分。
avg_similarity_score: 浮点数类型，表示平均相似度得分。

数据分割

curated: 包含771个样本，总大小为1937514字节。
raw: 包含5003个样本，总大小为12969008字节。
unique: 包含308个样本，总大小为786771字节。
compile: 包含3549个样本，总大小为9048805字节。

数据集大小

下载大小: 10935008字节
数据集总大小: 24742098字节

大小类别

1K<n<10K: 数据集大小在1千到1万之间。

任务类别

text2text-generation: 数据集用于文本到文本的生成任务。

许可证

bigscience-openrail-m: 数据集的许可证类型。

语言

en: 数据集主要使用英语。

数据集生成

生成参数

上下文示例数量: 4个
- 2个种子指令
- 2个机器生成指令
生成指令数量: 5000个
生成停止词: [" 20", "20.", "20 ."]
相似度阈值: 0.7

数据集质量

StarCoder模型在生成过程中会达到创造力的上限，存在许多相似的指令，但它们表述不同。

后处理

自一致性

通过反向提示模型，检查模型是否能生成与解决方案相对应的指令。模型在4个种子任务的参数下，能够恢复5003个样本中的1135个，占22.6%。
使用Sentence-BERT比较生成指令与真实指令的相似度，771个指令（约68%）的相似度得分≥0.5，形成curated集。

唯一性

通过检查之前生成的指令，移除相似度得分≥0.5的指令，移除约94%的原始数据集，剩余的形成unique集。

编译

构建一个仅包含无编译错误的Python 3代码示例的集合。

数据集结构

数据集加载

python from datasets import load_dataset

dataset = load_dataset("codeparrot/self-instruct-starcoder")

数据集字典

python DatasetDict({ compile: Dataset({ features: [instruction, output, most_similar, avg_similarity_score], num_rows: 3549 }) curated: Dataset({ features: [instruction, output, most_similar, avg_similarity_score], num_rows: 771 }) raw: Dataset({ features: [instruction, output, most_similar, avg_similarity_score], num_rows: 5003 }) unique: Dataset({ features: [instruction, output, most_similar, avg_similarity_score], num_rows: 308 }) })

字段描述

字段	类型	描述
instruction	字符串	指令
output	字符串	指令的输出
most_similar	字符串	包含与当前指令最相似的10个指令及其相似度得分
avg_similarity_score	浮点数	平均相似度得分

5,000+

优质数据集

54 个

任务类型

进入经典数据集