codeparrot/self-instruct-starcoder
收藏Hugging Face2023-10-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/codeparrot/self-instruct-starcoder
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: output
dtype: string
- name: most_similar
dtype: string
- name: avg_similarity_score
dtype: float64
splits:
- name: curated
num_bytes: 1937514
num_examples: 771
- name: raw
num_bytes: 12969008
num_examples: 5003
- name: unique
num_bytes: 786771
num_examples: 308
- name: compile
num_bytes: 9048805
num_examples: 3549
download_size: 10935008
dataset_size: 24742098
tags:
- code
size_categories:
- 1K<n<10K
task_categories:
- text2text-generation
license: bigscience-openrail-m
language:
- en
---
# Self-instruct-starcoder
## Table of Contents
- [Summary](#summary)
- [Our approach](#our-approach)
- [Dataset generation](#dataset-generation)
- [Dataset quality](#dataset-quality)
- [Post-processing](#post-processing)
- [Self-consistency](#self-consistency)
- [Uniqueness](#uniqueness)
- [Compile](#compile)
- [Dataset structure](#dataset-structure)
- [Space](#space)
## Summary
Self-instruct-starcoder is a dataset that was generated by prompting starcoder to generate new instructions based on some human-written seed instructions.
The underlying process is explained in the paper [self-instruct](https://arxiv.org/abs/2212.10560). This algorithm gave birth to famous machine generated
datasets such as [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) and [Code Alpaca](https://github.com/sahil280114/codealpaca) which are two datasets
obtained by prompting OpenAI `text-davinci-003` engine.
## Our approach
While our method is similar to self-instruct and stanford alpaca, we included some relevant modifications to the pipeline to account for what we wanted.
- Rather than using `text-davinci-003`, we chose to prompt [StarCoder](https://arxiv.org/abs/2305.06161) which is a 10x smaller LLM developed for code use cases. However, it is possible to use any decoder based LLM on the hub.
- We changed our seed tasks in order to have the model generate code related tasks. We completed the seed tasks from code alpaca with 20 additional algorithm instructions.
- We switched from the generation format `"instruction":` - `"input":` - `"output":` to the format `"instruction":` - `"output":` by concatenating each instruction and its input under the
keyword `instruction`. We did so because the previous prompting format tended to make the model generate test cases as input and their solution as output, which is not what we wanted.
- Finally, we incorporated the possibility to change the trigger word in the prompt. We thus replaced the `"instruction" :` keyword by `"Here is the correct solution to the problem ":` which
resulted into much better generated instructions.
## Dataset generation
The generation of the dataset was time consuming and we chose our parameters to limit the computational burden of our method.
- Number of examples in context : 4
- 2 seed instructions
- 2 machine generated instructions
- Number of instructions to generate : 5000
- Stop words used in the generation : ["\n20", "20.", "20 ."]
- Similarity threshold for rouge score : 0.7
## Dataset quality
StarCoder, while being a great model is not as capable as `text-davinci-003`. In the generation, the model quickly reach sort of a ceiling in terms of creativity.
There are many instructions that are similar to each other, but it should not bother since they are not phrased the same.
## Post-processing
Post-processing is an important part of the pipeline since it improves the quality of the dataset despite the fact that it implies getting rid of some examples. First we
need to identify what we want to avoid :
- A generated solution which does not answer to the corresponding instruction
- An instruction that is too similar to another one.
### Self-consistency
We imagined a process that we named **self-consistency**. The idea is to reverse-prompt the model to see if it can generate a sound instruction that corresponds to the
solution (output) it is prompted with. This is a particularly difficult few-shot task, and unfortunately StarCoder does not perform incredibly well on it. With a few-shot parameters of `4`
(all being seed tasks), the model is able to recover 1135 instructions out of 5003, which amount for 22.6% of the raw dataset. Fortunately, the inability for starcoder to generate instructions for some
solutions does not mean we should get rid of them. For the solutions (outputs) with generated instructions, we can compare these with the ground truth. For that we can use [Sentence-BERT](https://arxiv.org/abs/1908.10084) because the comparison should focus the meaning
rather than the word to word similarity ratio. We have about 771 instructions (~68%) with a similarity score >= 0.5 with their ground truth. These can be seen as high quality examples, they form the `curated` set.
<p align="center">
<img src="https://huggingface.co/datasets/codeparrot/self-instruct-starcoder/resolve/main/output.png" alt="drawing" width="300", height="300"/>
</p>
### Uniqueness
Another approach that can be used to clean the raw dataset is to focus on distinct instructions. For a given instruction, we go through all the instructions generated before it to see if there is one with a similarity score >= 0.5.
If it is the case, we remove that instruction. This process removes about 94% of the raw dataset, the remaining instructions form the `unique` set.
### Compile
We also decided to build a set which contains solely the example featuring a code written in python 3 which does not code a compilation error.
## Dataset structure
```python
from datasets import load_dataset
dataset = load_dataset("codeparrot/self-instruct-starcoder")
DatasetDict({
compile: Dataset({
features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'],
num_rows: 3549
})
curated: Dataset({
features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'],
num_rows: 771
})
raw: Dataset({
features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'],
num_rows: 5003
})
unique: Dataset({
features: ['instruction', 'output', 'most_similar', 'avg_similarity_score'],
num_rows: 308
})
}))
```
|Field|Type|Description|
|---|---|---|
|instruction|string|Instruction|
|output|string|Answer to the instruction|
|most_similar|string|Dictionnary containing the 10 most similar instructions generated before the current instruction along with the similarity scores|
|avg_similarity_score|float64| Average similarity score|
## Additional resources
- [Space(self-instruct-starcoder)](https://huggingface.co/spaces/codeparrot/self-instruct-starcoder)
- [Github Repository](https://github.com/ArmelRandy/Self-instruct)
## Citation
```
@misc{title={Self-Instruct-StarCoder},
author={Zebaze, Armel Randy},
doi={https://doi.org/10.57967/hf/0790},
}
```
提供机构:
codeparrot
原始信息汇总
数据集概述
数据集信息
特征
- instruction: 字符串类型,表示指令。
- output: 字符串类型,表示指令的输出。
- most_similar: 字符串类型,包含与当前指令最相似的10个指令及其相似度得分。
- avg_similarity_score: 浮点数类型,表示平均相似度得分。
数据分割
- curated: 包含771个样本,总大小为1937514字节。
- raw: 包含5003个样本,总大小为12969008字节。
- unique: 包含308个样本,总大小为786771字节。
- compile: 包含3549个样本,总大小为9048805字节。
数据集大小
- 下载大小: 10935008字节
- 数据集总大小: 24742098字节
标签
- code: 数据集与代码相关。
大小类别
- 1K<n<10K: 数据集大小在1千到1万之间。
任务类别
- text2text-generation: 数据集用于文本到文本的生成任务。
许可证
- bigscience-openrail-m: 数据集的许可证类型。
语言
- en: 数据集主要使用英语。
数据集生成
生成参数
- 上下文示例数量: 4个
- 2个种子指令
- 2个机器生成指令
- 生成指令数量: 5000个
- 生成停止词: [" 20", "20.", "20 ."]
- 相似度阈值: 0.7
数据集质量
- StarCoder模型在生成过程中会达到创造力的上限,存在许多相似的指令,但它们表述不同。
后处理
自一致性
- 通过反向提示模型,检查模型是否能生成与解决方案相对应的指令。模型在4个种子任务的参数下,能够恢复5003个样本中的1135个,占22.6%。
- 使用Sentence-BERT比较生成指令与真实指令的相似度,771个指令(约68%)的相似度得分≥0.5,形成
curated集。
唯一性
- 通过检查之前生成的指令,移除相似度得分≥0.5的指令,移除约94%的原始数据集,剩余的形成
unique集。
编译
- 构建一个仅包含无编译错误的Python 3代码示例的集合。
数据集结构
数据集加载
python from datasets import load_dataset
dataset = load_dataset("codeparrot/self-instruct-starcoder")
数据集字典
python DatasetDict({ compile: Dataset({ features: [instruction, output, most_similar, avg_similarity_score], num_rows: 3549 }) curated: Dataset({ features: [instruction, output, most_similar, avg_similarity_score], num_rows: 771 }) raw: Dataset({ features: [instruction, output, most_similar, avg_similarity_score], num_rows: 5003 }) unique: Dataset({ features: [instruction, output, most_similar, avg_similarity_score], num_rows: 308 }) })
字段描述
| 字段 | 类型 | 描述 |
|---|---|---|
| instruction | 字符串 | 指令 |
| output | 字符串 | 指令的输出 |
| most_similar | 字符串 | 包含与当前指令最相似的10个指令及其相似度得分 |
| avg_similarity_score | 浮点数 | 平均相似度得分 |



