five

Vezora/Tested-22k-Python-Alpaca

收藏
Hugging Face2023-12-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Vezora/Tested-22k-Python-Alpaca
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- Contributors: Nicolas Mejia Petit # Vezora's CodeTester Dataset ![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg) ## Introduction Today, on November 2, 2023, we are excited to release our internal Python dataset with 22,600 examples of code. These examples have been meticulously tested and verified as working. Our dataset was created using a script we developed. ### Dataset Creation - Our script operates by extracting Python code from the output section of Alpaca-formatted datasets. It tests each extracted piece of code, keeping it if it passes and removing it if it fails, then saves all the working code in a seperate dataset. - Our second script works by removing the not working code from your alpaca datasets, and saves it to a not working code json, and then keeps all the working examples along with any other non python related examples, and saves it. - !WARNING! this script does run on ypur local computer, with mutithreading so it runs fast, if there is any malicious python code in your dataset, it WILL run on your local computer so either run in a VM or don't sift through shady datasets. Lastly, it is required that you have python packages installed, just main ones most would have already installed but some like tkinter and other packages in order for certain lines of code to be tested. - (if you are struggling converting your dataset to alpaca format, give the first three questions of both datasets and ask chat gpt or bing to give you a script to convert the dataset to that format you want. Might take one or two tries.) - The creation of this dataset involved leveraging open source datasets from various sources, including Wizard-LM's Evol datasets, CodeUp's 19k, Sahils2801's Code Alpaca, Eric Heartford's Dolphin, and a selection of hand-prompted GPT-4 code questions. The resulting dataset was carefully deduplicated. - We discovered that many of the open source datasets contained thousands of non-functional code examples, often plagued by module errors and other issues. Importantly, our script's approach is highly adaptable and could potentially be used to test code in other languages such as C++, C, SQL, and more. ### Usage Guidelines We invested a significant amount of time in developing this script. If you intend to use it to extract functional code in your own projects or datasets, and or plan on using our dataset, please include the following attribution in your model's or dataset's repository: "Filtered Using Vezora's CodeTester" ## Motivation We are releasing our internal tool thanks to Open Chat 3.5's recognition of its foundational model limitations, particularly in tasks related to code. ### Limitations of Foundational Models It's essential to note that even when writing syntactically correct code, foundational models often lack access to up-to-date Python and API documentation. As a result, code generated by these models may contain errors stemming from outdated calls or methods. ## Building a Strong Python Code Model If you aspire to build a robust Python code model, we recommend the following steps: 1. Pretrain with Mistral 7b on UPTODATE Python and API documentations. (during our testing we found even when a model writes syntactyically correct code it lacks up to date api calls and functions.) 2. Consider incorporating programming textbooks into your training. 3. Fine-tune your model with our dataset using SFT (Supervised Fine-Tuning). In the future, we may also release our "not working" code dataset, allowing users to create a Discriminative Pretraining Objective (DPO) model to reward functional code over non-functional code. Although with the second script provided, it would be pretty easy to do it your self. We hope this dataset serves as a valuable resource for the community and contributes to the improvement of code-related AI models. Why there are some references to 188k, we had used a script to count the examples in the dataset, and not realized the script wasn't meant to alpaca datasets, so it counted the examples wrong. Therefore, this is "only" 22k of functioning python code examples. However we are soon to release a better coding dataset, people will be even very happy with, containing over 220,000 examples of code (only tested for python code but contains many other languages.) I will also be releasing 13k examples of not working code, for the purpose of a DPO datasets, or RLHF.

许可证:Apache-2.0 贡献者:尼古拉斯·梅希亚·佩蒂(Nicolas Mejia Petit) # Vezora代码测试器数据集(Vezora's CodeTester Dataset) ![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg) ## 简介 2023年11月2日,我们荣幸地发布内部Python数据集,其中包含22600条经过精心测试并验证可正常运行的代码示例。本数据集通过我们自研的脚本生成。 ### 数据集构建流程 - 我们的首条脚本从Alpaca格式数据集(Alpaca-formatted datasets)的输出区段中提取Python代码,对每段提取的代码进行运行测试:通过测试的代码予以保留,未通过的则剔除,最终将所有可正常运行的代码保存为独立数据集。 - 我们的第二条脚本可从你的Alpaca格式数据集中移除无法正常运行的代码,将其保存为失效代码JSON文件,同时保留所有可正常运行的代码示例以及所有非Python相关示例,并完成保存。 - ⚠️ 警告:本脚本将在你的本地计算机上运行,且采用多线程以提升运行速度。若你的数据集中包含恶意Python代码,该代码将在本地计算机上执行,因此建议在虚拟机中运行脚本,或切勿筛选来源不明的数据集。此外,运行脚本需预先安装Python依赖包:多数用户已预装部分常用包,但部分场景下需安装`tkinter`等额外依赖以完成特定代码行的测试。 - (若你在将数据集转换为Alpaca格式时遇到困难,可提取两个数据集的前三个问题,向ChatGPT或必应(Bing)请求生成适配你需求的数据集转换脚本,此过程可能需要1至2次尝试。) - 本数据集的构建依托多来源开源数据集,包括Wizard-LM的Evol数据集、CodeUp的19k数据集、Sahils2801的Code Alpaca数据集、Eric Heartford的Dolphin数据集,以及若干人工提示生成的GPT-4代码问题。最终生成的数据集已经过严格的去重处理。 - 我们发现多数开源数据集包含数千条无法正常运行的代码示例,常存在模块调用错误等各类问题。值得注意的是,我们的脚本思路具备极强的可扩展性,可用于测试C++、C、SQL等其他编程语言的代码。 ## 使用指南 我们在本脚本的开发中投入了大量精力。若你计划使用该脚本从自有项目或数据集中提取可正常运行的代码,或使用本数据集,请在你的模型或数据集仓库中注明以下来源:"Filtered Using Vezora's CodeTester"(使用Vezora代码测试器进行过滤)。 ## 研发动机 我们之所以发布内部工具,是因为Open Chat 3.5团队意识到其基础大语言模型存在局限性,尤其在代码相关任务中表现明显。 ### 基础大语言模型的局限性 需特别说明的是,即便生成的代码语法正确,基础大语言模型(Large Language Model, LLM)通常无法获取最新的Python及API文档。因此,这类模型生成的代码可能因调用过时的接口或方法而存在错误。 ## 构建高性能Python代码模型 若你希望构建高性能的Python代码模型,我们建议遵循以下步骤: 1. 基于最新的Python及API文档对Mistral 7b进行预训练。(我们在测试中发现,即便模型生成的代码语法正确,也可能缺少最新的API调用与函数实现。) 2. 考虑将编程教材纳入训练数据集。 3. 使用本数据集通过SFT(监督微调,Supervised Fine-Tuning)对模型进行微调。 未来我们还可能发布“失效代码”数据集,供用户构建判别式预训练目标(Discriminative Pretraining Objective, DPO)模型,以区分并奖励可正常运行的代码而非失效代码。不过借助我们提供的第二条脚本,你也可以自行完成此类数据集的构建。 我们希望本数据集能够为社区提供有价值的资源,并助力代码相关AI模型的迭代升级。 关于文中提及的188k数据量,是因为我们曾使用脚本统计数据集样本数,但未意识到该脚本并非专为Alpaca格式数据集设计,因此统计结果出现错误。因此本数据集仅包含22000条可正常运行的Python代码示例。不过我们即将发布一款更优质的代码数据集,其中包含超过220000条代码示例(仅针对Python代码进行了运行测试,但同时涵盖多种其他编程语言)。此外,我还将发布13000条失效代码示例,用于构建DPO数据集或RLHF(基于人类反馈的强化学习,Reinforcement Learning from Human Feedback)。
提供机构:
Vezora
原始信息汇总

数据集贡献者

  • Nicolas Mejia Petit
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作