HKBU-NLP/Code-Evol-Instruct-OSS

Name: HKBU-NLP/Code-Evol-Instruct-OSS
Creator: HKBU-NLP
Published: 2023-11-09 14:05:04
License: 暂无描述

Hugging Face2023-11-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/HKBU-NLP/Code-Evol-Instruct-OSS

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: bigcode-openrail-m language: - en size_categories: - 1K<n<10K --- # Code-Evol-Instruct-OSS ## Summary Code-Evol-Instruct-OSS is a dataset that was generated with Code Evol-Instruct by prompting open-souce LLMs, WizardLM-13B-v1.2 and WizardCoder-34B-Python. The underlying process is explained in the paper [code-evol-instruct](https://arxiv.org/abs/2306.08568). This algorithm gave birth to famous open-souce code LLMs, WizardCoder-Family. ## Our approach - We did not use any closed-source LLMs. - Our seed dataset is sourced from [self-instruct-starcoder](https://huggingface.co/datasets/codeparrot/self-instruct-starcoder). - We leverage the WizardLM-13B-v1.2 to evol the instructions in three rounds. - The responses to each instruction are generated using WizardCoder-34B-Python. - Samples that are excessively long or lack code responses are filtered out. - The final dataset contains 4308 samples. ## Preliminary Experiments We've fine-tuned the starcoderbase-3b using this dataset, achieving a 28.7 pass@1 on HumanEval (greedy), surpassing the original model by approximately 8 points. ## Citation If you use this dataset, please cite the paper of WizardCoder. ``` @misc{luo2023wizardcoder, title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct}, author={Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang}, year={2023}, eprint={2306.08568}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

HKBU-NLP

原始信息汇总

Code-Evol-Instruct-OSS

概述

Code-Evol-Instruct-OSS 是一个数据集，通过使用开源大型语言模型（LLMs）WizardLM-13B-v1.2 和 WizardCoder-34B-Python 进行提示生成。该数据集的生成过程在论文 code-evol-instruct 中有详细解释。这一算法催生了著名的开源代码 LLMs，WizardCoder-Family。

我们的方法

未使用任何闭源 LLMs。
种子数据集来源于 self-instruct-starcoder。
利用 WizardLM-13B-v1.2 进行三轮指令进化。
每个指令的响应由 WizardCoder-34B-Python 生成。
过滤掉过长或缺乏代码响应的样本。
最终数据集包含 4308 个样本。

初步实验

我们使用该数据集对 starcoderbase-3b 进行了微调，实现了 28.7 pass@1 的 HumanEval（贪婪），比原始模型提高了约 8 个百分点。

引用

如果使用此数据集，请引用 WizardCoder 的论文。

@misc{luo2023wizardcoder, title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct}, author={Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang}, year={2023}, eprint={2306.08568}, archivePrefix={arXiv}, primaryClass={cs.CL} }

搜集汇总

数据集介绍

背景与挑战

背景概述

Code-Evol-Instruct-OSS是一个用于代码大语言模型研究的指令演化数据集，通过开源LLMs（如WizardLM和WizardCoder）基于self-instruct-starcoder种子生成，包含约4308个样本，旨在提升模型在代码生成任务上的性能，如HumanEval基准测试。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集