five

huawei-noah/python_text2code

收藏
Hugging Face2024-09-04 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/huawei-noah/python_text2code
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是从2021年5月之前的GitHub公共仓库中爬取的Python代码文件,用于代码合成任务(即文本到代码生成)。数据集中的文件经过筛选,保留了文件大小小于1MB、兼容Python3、每行平均字符数少于100、单行字符数少于1,000的文件。通过AST解析提取了有效的函数及其对应的docstring,并将docstring作为问题描述与代码分离。最终数据集包含23,526,586个文本到代码的配对,主要用于Python代码生成任务。

The dataset was crawled from public repositories on GitHub before May 2021, intended for additional model training for the task of Code Synthesis (i.e., Text-to-Code generation) in Python. The data filtering criteria include file size under 1MB, Python3 compatibility, fewer than 100 characters per line on average, and fewer than 1,000 characters in any single line. Valid functions and their corresponding docstrings were extracted using AST parsing, and the docstrings were used as problem descriptions and separated from the code. Instances without a docstring were discarded. The final dataset contains 23,526,586 text-to-code pairs in Python.
提供机构:
huawei-noah
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作