five

HKBU-NLP/Code-Evol-Instruct-OSS

收藏
Hugging Face2023-11-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HKBU-NLP/Code-Evol-Instruct-OSS
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: bigcode-openrail-m language: - en size_categories: - 1K<n<10K --- # Code-Evol-Instruct-OSS ## Summary Code-Evol-Instruct-OSS is a dataset that was generated with Code Evol-Instruct by prompting open-souce LLMs, WizardLM-13B-v1.2 and WizardCoder-34B-Python. The underlying process is explained in the paper [code-evol-instruct](https://arxiv.org/abs/2306.08568). This algorithm gave birth to famous open-souce code LLMs, WizardCoder-Family. ## Our approach - We did not use any closed-source LLMs. - Our seed dataset is sourced from [self-instruct-starcoder](https://huggingface.co/datasets/codeparrot/self-instruct-starcoder). - We leverage the WizardLM-13B-v1.2 to evol the instructions in three rounds. - The responses to each instruction are generated using WizardCoder-34B-Python. - Samples that are excessively long or lack code responses are filtered out. - The final dataset contains 4308 samples. ## Preliminary Experiments We've fine-tuned the starcoderbase-3b using this dataset, achieving a 28.7 pass@1 on HumanEval (greedy), surpassing the original model by approximately 8 points. ## Citation If you use this dataset, please cite the paper of WizardCoder. ``` @misc{luo2023wizardcoder, title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct}, author={Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang}, year={2023}, eprint={2306.08568}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
HKBU-NLP
原始信息汇总

Code-Evol-Instruct-OSS

概述

Code-Evol-Instruct-OSS 是一个数据集,通过使用开源大型语言模型(LLMs)WizardLM-13B-v1.2 和 WizardCoder-34B-Python 进行提示生成。该数据集的生成过程在论文 code-evol-instruct 中有详细解释。这一算法催生了著名的开源代码 LLMs,WizardCoder-Family。

我们的方法

  • 未使用任何闭源 LLMs。
  • 种子数据集来源于 self-instruct-starcoder
  • 利用 WizardLM-13B-v1.2 进行三轮指令进化。
  • 每个指令的响应由 WizardCoder-34B-Python 生成。
  • 过滤掉过长或缺乏代码响应的样本。
  • 最终数据集包含 4308 个样本。

初步实验

我们使用该数据集对 starcoderbase-3b 进行了微调,实现了 28.7 pass@1 的 HumanEval(贪婪),比原始模型提高了约 8 个百分点。

引用

如果使用此数据集,请引用 WizardCoder 的论文。

@misc{luo2023wizardcoder, title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct}, author={Ziyang Luo and Can Xu and Pu Zhao and Qingfeng Sun and Xiubo Geng and Wenxiang Hu and Chongyang Tao and Jing Ma and Qingwei Lin and Daxin Jiang}, year={2023}, eprint={2306.08568}, archivePrefix={arXiv}, primaryClass={cs.CL} }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Code-Evol-Instruct-OSS是一个用于代码大语言模型研究的指令演化数据集,通过开源LLMs(如WizardLM和WizardCoder)基于self-instruct-starcoder种子生成,包含约4308个样本,旨在提升模型在代码生成任务上的性能,如HumanEval基准测试。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作