HKBU-NLP/Code-Evol-Instruct-OSS
收藏Hugging Face2023-11-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HKBU-NLP/Code-Evol-Instruct-OSS
下载链接
链接失效反馈官方服务:
资源简介:
---
license: bigcode-openrail-m
language:
- en
size_categories:
- 1K<n<10K
---
# Code-Evol-Instruct-OSS
## Summary
Code-Evol-Instruct-OSS is a dataset that was generated with Code Evol-Instruct by prompting open-souce LLMs, WizardLM-13B-v1.2 and WizardCoder-34B-Python.
The underlying process is explained in the paper [code-evol-instruct](https://arxiv.org/abs/2306.08568). This algorithm gave birth to famous open-souce code LLMs, WizardCoder-Family.
## Our approach
- We did not use any closed-source LLMs.
- Our seed dataset is sourced from [self-instruct-starcoder](https://huggingface.co/datasets/codeparrot/self-instruct-starcoder).
- We leverage the WizardLM-13B-v1.2 to evol the instructions in three rounds.
- The responses to each instruction are generated using WizardCoder-34B-Python.
- Samples that are excessively long or lack code responses are filtered out.
- The final dataset contains 4308 samples.
## Preliminary Experiments
We've fine-tuned the starcoderbase-3b using this dataset, achieving a 28.7 pass@1 on HumanEval (greedy), surpassing the original model by approximately 8 points.
## Citation
If you use this dataset, please cite the paper of WizardCoder.
```
@misc{luo2023wizardcoder,
title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct},
author={Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang},
year={2023},
eprint={2306.08568},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
HKBU-NLP
原始信息汇总
Code-Evol-Instruct-OSS
概述
Code-Evol-Instruct-OSS 是一个数据集,通过使用开源大型语言模型(LLMs)WizardLM-13B-v1.2 和 WizardCoder-34B-Python 进行提示生成。该数据集的生成过程在论文 code-evol-instruct 中有详细解释。这一算法催生了著名的开源代码 LLMs,WizardCoder-Family。
我们的方法
- 未使用任何闭源 LLMs。
- 种子数据集来源于 self-instruct-starcoder。
- 利用 WizardLM-13B-v1.2 进行三轮指令进化。
- 每个指令的响应由 WizardCoder-34B-Python 生成。
- 过滤掉过长或缺乏代码响应的样本。
- 最终数据集包含 4308 个样本。
初步实验
我们使用该数据集对 starcoderbase-3b 进行了微调,实现了 28.7 pass@1 的 HumanEval(贪婪),比原始模型提高了约 8 个百分点。
引用
如果使用此数据集,请引用 WizardCoder 的论文。
@misc{luo2023wizardcoder, title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct}, author={Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang}, year={2023}, eprint={2306.08568}, archivePrefix={arXiv}, primaryClass={cs.CL} }
搜集汇总
数据集介绍

背景与挑战
背景概述
Code-Evol-Instruct-OSS是一个用于代码大语言模型研究的指令演化数据集,通过开源LLMs(如WizardLM和WizardCoder)基于self-instruct-starcoder种子生成,包含约4308个样本,旨在提升模型在代码生成任务上的性能,如HumanEval基准测试。
以上内容由遇见数据集搜集并总结生成



