codeparrot/github-code

Name: codeparrot/github-code
Creator: codeparrot
Published: 2022-10-20 15:01:14
License: 暂无描述

Hugging Face2022-10-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/codeparrot/github-code

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language_creators: - crowdsourced - expert-generated language: - code license: - other multilinguality: - multilingual pretty_name: github-code size_categories: - unknown source_datasets: [] task_categories: - text-generation task_ids: - language-modeling --- # GitHub Code Dataset ## Dataset Description The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery. ### How to use it The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of `datasets`. You can load and iterate through the dataset with the following two lines of code: ```python from datasets import load_dataset ds = load_dataset("codeparrot/github-code", streaming=True, split="train") print(next(iter(ds))) #OUTPUT: { 'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n", 'repo_name': 'MirekSz/webpack-es6-ts', 'path': 'app/mods/mod190.js', 'language': 'JavaScript', 'license': 'isc', 'size': 73 } ``` You can see that besides the code, repo name, and path also the programming language, license, and the size of the file are part of the dataset. You can also filter the dataset for any subset of the 30 included languages (see the full list below) in the dataset. Just pass the list of languages as a list. E.g. if your dream is to build a Codex model for Dockerfiles use the following configuration: ```python ds = load_dataset("codeparrot/github-code", streaming=True, split="train", languages=["Dockerfile"]) print(next(iter(ds))["code"]) #OUTPUT: """\ FROM rockyluke/ubuntu:precise ENV DEBIAN_FRONTEND="noninteractive" \ TZ="Europe/Amsterdam" ... """ ``` We also have access to the license of the origin repo of a file so we can filter for licenses in the same way we filtered for languages: ```python ds = load_dataset("codeparrot/github-code", streaming=True, split="train", licenses=["mit", "isc"]) licenses = [] for element in iter(ds).take(10_000): licenses.append(element["license"]) print(Counter(licenses)) #OUTPUT: Counter({'mit': 9896, 'isc': 104}) ``` Naturally, you can also download the full dataset. Note that this will download ~300GB compressed text data and the uncompressed dataset will take up ~1TB of storage: ```python ds = load_dataset("codeparrot/github-code", split="train") ``` ## Data Structure ### Data Instances ```python { 'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n", 'repo_name': 'MirekSz/webpack-es6-ts', 'path': 'app/mods/mod190.js', 'language': 'JavaScript', 'license': 'isc', 'size': 73 } ``` ### Data Fields |Field|Type|Description| |---|---|---| |code|string|content of source file| |repo_name|string|name of the GitHub repository| |path|string|path of file in GitHub repository| |language|string|programming language as inferred by extension| |license|string|license of GitHub repository| |size|int|size of source file in bytes| ### Data Splits The dataset only contains a train split. ## Languages The dataset contains 30 programming languages with over 60 extensions: ```python { "Assembly": [".asm"], "Batchfile": [".bat", ".cmd"], "C": [".c", ".h"], "C#": [".cs"], "C++": [".cpp", ".hpp", ".c++", ".h++", ".cc", ".hh", ".C", ".H"], "CMake": [".cmake"], "CSS": [".css"], "Dockerfile": [".dockerfile", "Dockerfile"], "FORTRAN": ['.f90', '.f', '.f03', '.f08', '.f77', '.f95', '.for', '.fpp'], "GO": [".go"], "Haskell": [".hs"], "HTML":[".html"], "Java": [".java"], "JavaScript": [".js"], "Julia": [".jl"], "Lua": [".lua"], "Makefile": ["Makefile"], "Markdown": [".md", ".markdown"], "PHP": [".php", ".php3", ".php4", ".php5", ".phps", ".phpt"], "Perl": [".pl", ".pm", ".pod", ".perl"], "PowerShell": ['.ps1', '.psd1', '.psm1'], "Python": [".py"], "Ruby": [".rb"], "Rust": [".rs"], "SQL": [".sql"], "Scala": [".scala"], "Shell": [".sh", ".bash", ".command", ".zsh"], "TypeScript": [".ts", ".tsx"], "TeX": [".tex"], "Visual Basic": [".vb"] } ``` ## Licenses Each example is also annotated with the license of the associated repository. There are in total 15 licenses: ```python [ 'mit', 'apache-2.0', 'gpl-3.0', 'gpl-2.0', 'bsd-3-clause', 'agpl-3.0', 'lgpl-3.0', 'lgpl-2.1', 'bsd-2-clause', 'cc0-1.0', 'epl-1.0', 'mpl-2.0', 'unlicense', 'isc', 'artistic-2.0' ] ``` ## Dataset Statistics The dataset contains 115M files and the sum of all the source code file sizes is 873 GB (note that the size of the dataset is larger due to the extra fields). A breakdown per language is given in the plot and table below: ![dataset-statistics](https://huggingface.co/datasets/codeparrot/github-code/resolve/main/github-code-stats-alpha.png) | | Language |File Count| Size (GB)| |---:|:-------------|---------:|-------:| | 0 | Java | 19548190 | 107.70 | | 1 | C | 14143113 | 183.83 | | 2 | JavaScript | 11839883 | 87.82 | | 3 | HTML | 11178557 | 118.12 | | 4 | PHP | 11177610 | 61.41 | | 5 | Markdown | 8464626 | 23.09 | | 6 | C++ | 7380520 | 87.73 | | 7 | Python | 7226626 | 52.03 | | 8 | C# | 6811652 | 36.83 | | 9 | Ruby | 4473331 | 10.95 | | 10 | GO | 2265436 | 19.28 | | 11 | TypeScript | 1940406 | 24.59 | | 12 | CSS | 1734406 | 22.67 | | 13 | Shell | 1385648 | 3.01 | | 14 | Scala | 835755 | 3.87 | | 15 | Makefile | 679430 | 2.92 | | 16 | SQL | 656671 | 5.67 | | 17 | Lua | 578554 | 2.81 | | 18 | Perl | 497949 | 4.70 | | 19 | Dockerfile | 366505 | 0.71 | | 20 | Haskell | 340623 | 1.85 | | 21 | Rust | 322431 | 2.68 | | 22 | TeX | 251015 | 2.15 | | 23 | Batchfile | 236945 | 0.70 | | 24 | CMake | 175282 | 0.54 | | 25 | Visual Basic | 155652 | 1.91 | | 26 | FORTRAN | 142038 | 1.62 | | 27 | PowerShell | 136846 | 0.69 | | 28 | Assembly | 82905 | 0.78 | | 29 | Julia | 58317 | 0.29 | ## Dataset Creation The dataset was created in two steps: 1. Files of with the extensions given in the list above were retrieved from the GitHub dataset on BigQuery (full query [here](https://huggingface.co/datasets/codeparrot/github-code/blob/main/query.sql)). The query was executed on _Mar 16, 2022, 6:23:39 PM UTC+1_. 2. Files with lines longer than 1000 characters and duplicates (exact duplicates ignoring whitespaces) were dropped (full preprocessing script [here](https://huggingface.co/datasets/codeparrot/github-code/blob/main/github_preprocessing.py)). ## Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames. ## Releases You can load any older version of the dataset with the `revision` argument: ```Python ds = load_dataset("codeparrot/github-code", revision="v1.0") ``` ### v1.0 - Initial release of dataset - The query was executed on _Feb 14, 2022, 12:03:16 PM UTC+1_ ### v1.1 - Fix missing Scala/TypeScript - Fix deduplication issue with inconsistent Python `hash` - The query was executed on _Mar 16, 2022, 6:23:39 PM UTC+1_

annotations_creators: [] language_creators: - 众包 - 专家生成 language: - 代码 license: - 其他 multilinguality: - 多语言 pretty_name: github-code size_categories: - 未知 source_datasets: [] task_categories: - 文本生成 task_ids: - 语言建模 # GitHub代码数据集 ## 数据集描述本GitHub代码数据集包含来自GitHub的1.15亿个代码文件，涵盖32种编程语言、60余种文件扩展名，总数据量达1TB。该数据集源自谷歌BigQuery上的公开GitHub数据集。 ### 使用方法由于本数据集规模庞大，多数场景下建议使用`datasets`库的流式API（streaming API）。可通过以下两行代码加载并遍历数据集： python from datasets import load_dataset ds = load_dataset("codeparrot/github-code", streaming=True, split="train") print(next(iter(ds))) #OUTPUT: { 'code': "import mod189 from './mod189'; var value=mod189+1; export default value; ", 'repo_name': 'MirekSz/webpack-es6-ts', 'path': 'app/mods/mod190.js', 'language': 'JavaScript', 'license': 'isc', 'size': 73 } 可以看到，除代码、仓库名称与文件路径外，数据集还包含编程语言、许可证以及文件大小信息。你也可以根据数据集中包含的30种编程语言（完整列表见下文）筛选子集，只需传入语言列表即可。例如，若你希望针对Dockerfile构建类似Codex的大语言模型（Large Language Model, LLM），可使用如下配置： python ds = load_dataset("codeparrot/github-code", streaming=True, split="train", languages=["Dockerfile"]) print(next(iter(ds))["code"]) #OUTPUT: """ FROM rockyluke/ubuntu:precise ENV DEBIAN_FRONTEND="noninteractive" TZ="Europe/Amsterdam" ... """ 我们还可以获取文件所属仓库的许可证信息，因此可采用与筛选语言相同的方式对许可证进行筛选： python ds = load_dataset("codeparrot/github-code", streaming=True, split="train", licenses=["mit", "isc"]) licenses = [] for element in iter(ds).take(10_000): licenses.append(element["license"]) print(Counter(licenses)) #OUTPUT: Counter({'mit': 9896, 'isc': 104}) 当然，你也可以下载完整数据集。请注意，压缩后的文本数据下载量约为300GB，解压后数据集将占用约1TB存储空间： python ds = load_dataset("codeparrot/github-code", split="train") ## 数据结构 ### 数据实例 python { 'code': "import mod189 from './mod189'; var value=mod189+1; export default value; ", 'repo_name': 'MirekSz/webpack-es6-ts', 'path': 'app/mods/mod190.js', 'language': 'JavaScript', 'license': 'isc', 'size': 73 } ### 数据字段 |字段名称|数据类型|描述| |---|---|---| |code|字符串|源代码文件内容| |repo_name|字符串|GitHub仓库名称| |path|字符串|文件在GitHub仓库中的路径| |language|字符串|通过文件扩展名推断的编程语言| |license|字符串|GitHub仓库的许可证| |size|整数|源代码文件的字节大小| ### 数据划分本数据集仅包含训练划分（train split）。 ## 编程语言列表本数据集涵盖30种编程语言，对应60余种文件扩展名： python { "Assembly": [".asm"], "Batchfile": [".bat", ".cmd"], "C": [".c", ".h"], "C#": [".cs"], "C++": [".cpp", ".hpp", ".c++", ".h++", ".cc", ".hh", ".C", ".H"], "CMake": [".cmake"], "CSS": [".css"], "Dockerfile": [".dockerfile", "Dockerfile"], "FORTRAN": ['.f90', '.f', '.f03', '.f08', '.f77', '.f95', '.for', '.fpp'], "GO": [".go"], "Haskell": [".hs"], "HTML":[".html"], "Java": [".java"], "JavaScript": [".js"], "Julia": [".jl"], "Lua": [".lua"], "Makefile": ["Makefile"], "Markdown": [".md", ".markdown"], "PHP": [".php", ".php3", ".php4", ".php5", ".phps", ".phpt"], "Perl": [".pl", ".pm", ".pod", ".perl"], "PowerShell": ['.ps1', '.psd1', '.psm1'], "Python": [".py"], "Ruby": [".rb"], "Rust": [".rs"], "SQL": [".sql"], "Scala": [".scala"], "Shell": [".sh", ".bash", ".command", ".zsh"], "TypeScript": [".ts", ".tsx"], "TeX": [".tex"], "Visual Basic": [".vb"] } ## 许可证类型每个数据样本均附带对应仓库的许可证信息，数据集共包含15种许可证： python [ 'mit', 'apache-2.0', 'gpl-3.0', 'gpl-2.0', 'bsd-3-clause', 'agpl-3.0', 'lgpl-3.0', 'lgpl-2.1', 'bsd-2-clause', 'cc0-1.0', 'epl-1.0', 'mpl-2.0', 'unlicense', 'isc', 'artistic-2.0' ] ## 数据集统计信息本数据集共包含1.15亿个文件，所有源代码文件的总大小为873GB（请注意：由于额外字段的存在，数据集实际占用存储空间更大）。各编程语言的统计细分情况如下图与下表所示： ![数据集统计信息](https://huggingface.co/datasets/codeparrot/github-code/resolve/main/github-code-stats-alpha.png) | 序号 | 编程语言 | 文件数量 | 大小（GB） | |---:|:-------------|---------:|-------:| | 0 | Java | 19548190 | 107.70 | | 1 | C | 14143113 | 183.83 | | 2 | JavaScript | 11839883 | 87.82 | | 3 | HTML | 11178557 | 118.12 | | 4 | PHP | 11177610 | 61.41 | | 5 | Markdown | 8464626 | 23.09 | | 6 | C++ | 7380520 | 87.73 | | 7 | Python | 7226626 | 52.03 | | 8 | C# | 6811652 | 36.83 | | 9 | Ruby | 4473331 | 10.95 | | 10 | GO | 2265436 | 19.28 | | 11 | TypeScript | 1940406 | 24.59 | | 12 | CSS | 1734406 | 22.67 | | 13 | Shell | 1385648 | 3.01 | | 14 | Scala | 835755 | 3.87 | | 15 | Makefile | 679430 | 2.92 | | 16 | SQL | 656671 | 5.67 | | 17 | Lua | 578554 | 2.81 | | 18 | Perl | 497949 | 4.70 | | 19 | Dockerfile | 366505 | 0.71 | | 20 | Haskell | 340623 | 1.85 | | 21 | Rust | 322431 | 2.68 | | 22 | TeX | 251015 | 2.15 | | 23 | Batchfile | 236945 | 0.70 | | 24 | CMake | 175282 | 0.54 | | 25 | Visual Basic | 155652 | 1.91 | | 26 | FORTRAN | 142038 | 1.62 | | 27 | PowerShell | 136846 | 0.69 | | 28 | Assembly | 82905 | 0.78 | | 29 | Julia | 58317 | 0.29 | ## 数据集构建本数据集通过两步构建完成： 1. 从BigQuery上的GitHub公开数据集中提取上文列表中指定扩展名的代码文件（完整查询语句见[此处](https://huggingface.co/datasets/codeparrot/github-code/blob/main/query.sql)）。该查询于**2022年3月16日，UTC+1 18:23:39**执行。 2. 移除所有单行长度超过1000字符的文件以及重复文件（忽略空白字符的精确重复）（完整预处理脚本见[此处](https://huggingface.co/datasets/codeparrot/github-code/blob/main/github_preprocessing.py)）。 ## 数据使用注意事项本数据集包含来自各类公开仓库的源代码，因此可能包含有害或带有偏见的代码，以及密码、用户名等敏感信息。 ## 版本发布你可以通过`revision`参数加载数据集的历史版本： Python ds = load_dataset("codeparrot/github-code", revision="v1.0") ### v1.0版本 - 数据集首次发布 - 查询于**2022年2月14日，UTC+1 12:03:16**执行 ### v1.1版本 - 修复了Scala/TypeScript语言缺失的问题 - 修复了Python哈希值不一致导致的去重问题 - 查询于**2022年3月16日，UTC+1 18:23:39**执行

提供机构：

codeparrot

原始信息汇总

GitHub Code Dataset 概述

数据集描述

GitHub Code 数据集包含来自 GitHub 的 115M 代码文件，涵盖 32 种编程语言，共计 1TB 数据。该数据集源自 Google BigQuery 上的公共 GitHub 数据集。

数据结构

数据实例

python { code: "import mod189 from ./mod189; var value=mod189+1; export default value; ", repo_name: MirekSz/webpack-es6-ts, path: app/mods/mod190.js, language: JavaScript, license: isc, size: 73 }

数据字段

字段	类型	描述
code	string	源文件内容
repo_name	string	GitHub 仓库名称
path	string	GitHub 仓库中的文件路径
language	string	由扩展推断的编程语言
license	string	GitHub 仓库的许可证
size	int	源文件大小（字节）

数据分割

数据集仅包含训练分割。

编程语言

数据集包含 30 种编程语言，超过 60 种扩展，例如：

Assembly: [".asm"]
Batchfile: [".bat", ".cmd"]
C: [".c", ".h"]
C#: [".cs"]
C++: [".cpp", ".hpp", ".c++", ".h++", ".cc", ".hh", ".C", ".H"]
CMake: [".cmake"]
CSS: [".css"]
Dockerfile: [".dockerfile", "Dockerfile"]
FORTRAN: [.f90, .f, .f03, .f08, .f77, .f95, .for, .fpp]
GO: [".go"]
Haskell: [".hs"]
HTML: [".html"]
Java: [".java"]
JavaScript: [".js"]
Julia: [".jl"]
Lua: [".lua"]
Makefile: ["Makefile"]
Markdown: [".md", ".markdown"]
PHP: [".php", ".php3", ".php4", ".php5", ".phps", ".phpt"]
Perl: [".pl", ".pm", ".pod", ".perl"]
PowerShell: [.ps1, .psd1, .psm1]
Python: [".py"]
Ruby: [".rb"]
Rust: [".rs"]
SQL: [".sql"]
Scala: [".scala"]
Shell: [".sh", ".bash", ".command", ".zsh"]
TypeScript: [".ts", ".tsx"]
TeX: [".tex"]
Visual Basic: [".vb"]

许可证

每个示例都标注了关联仓库的许可证，共有 15 种许可证，例如：

mit
apache-2.0
gpl-3.0
gpl-2.0
bsd-3-clause
agpl-3.0
lgpl-3.0
lgpl-2.1
bsd-2-clause
cc0-1.0
epl-1.0
mpl-2.0
unlicense
isc
artistic-2.0

数据集统计

数据集包含 115M 文件，所有源代码文件总大小为 873 GB。以下是按语言分类的统计数据：

语言	文件数	大小（GB）
Java	19548190	107.70
C	14143113	183.83
JavaScript	11839883	87.82
HTML	11178557	118.12
PHP	11177610	61.41
Markdown	8464626	23.09
C++	7380520	87.73
Python	7226626	52.03
C#	6811652	36.83
Ruby	4473331	10.95
GO	2265436	19.28
TypeScript	1940406	24.59
CSS	1734406	22.67
Shell	1385648	3.01
Scala	835755	3.87
Makefile	679430	2.92
SQL	656671	5.67
Lua	578554	2.81
Perl	497949	4.70
Dockerfile	366505	0.71
Haskell	340623	1.85
Rust	322431	2.68
TeX	251015	2.15
Batchfile	236945	0.70
CMake	175282	0.54
Visual Basic	155652	1.91
FORTRAN	142038	1.62
PowerShell	136846	0.69
Assembly	82905	0.78
Julia	58317	0.29

数据集创建

数据集创建分为两个步骤：

从 BigQuery 上的 GitHub 数据集中检索具有上述扩展名的文件。
删除行长度超过 1000 个字符的文件和重复文件（忽略空白的完全重复）。

使用数据注意事项

数据集包含来自广泛仓库的源代码，可能包含有害或带有偏见的代码，以及敏感信息如密码或用户名。

搜集汇总

数据集介绍

构建方式

GitHub Code数据集是由GitHub上的公共代码文件构建而成，具体过程分为两步：首先，从Google BigQuery上的GitHub数据集中检索出具有特定扩展名的文件；其次，移除包含超过1000个字符的行和完全重复（忽略空格）的文件，以确保数据质量。

特点

该数据集包含32种编程语言的115M个代码文件，涵盖60种不同的文件扩展名，总数据量达到1TB。数据集不仅包含代码内容，还包含仓库名称、文件路径、编程语言、许可证类型和文件大小等元数据信息。此外，数据集还按照编程语言和许可证类型进行了分类。

使用方法

用户可以通过HuggingFace的datasets库以流式API的方式加载整个数据集，从而实现对数据集的遍历和操作。用户可以根据需要，过滤特定的编程语言或许可证类型，也可以选择下载整个数据集。需要注意的是，下载整个数据集将占用大约300GB的压缩存储空间和1TB的解压存储空间。

背景与挑战

背景概述

GitHub Code数据集，由CodeParrot团队创建，汇聚了来自GitHub的115M代码文件，涵盖32种编程语言，共60种扩展名，数据总量达1TB。该数据集的构建基于Google BigQuery中的公共GitHub数据集，旨在为编程语言模型训练提供丰富的源代码资源。自2022年起，该数据集已被用于促进编程语言处理的研究，尤其在代码生成和代码理解领域产生了显著影响。

当前挑战

在构建过程中，研究团队面临了如何高效处理大规模代码数据集的挑战，包括数据清洗、去重以及处理超长代码行。此外，数据集在解决编程语言领域问题，如代码分类、代码生成等方面的应用也面临挑战，如如何确保模型的泛化能力，以及如何处理代码中潜在的有害或敏感信息。

常用场景

经典使用场景

在计算机科学和软件工程领域，GitHub Code数据集的典型应用场景在于构建和训练代码生成模型。该数据集提供了海量的编程语言代码实例，使得研究者能够开发出能够理解和生成多种编程语言代码的人工智能模型，从而实现代码自动生成、代码补全以及代码风格模仿等功能。

解决学术问题

该数据集解决了学术研究中对于大规模代码数据的需求问题，为研究代码质量、编程语言特性、软件演化等课题提供了丰富的实证基础。通过分析数据集中的代码实例，研究者能够深入理解编程语言的使用模式，进而推动编程语言理论的发展，促进软件工程实践的革新。

衍生相关工作

基于GitHub Code数据集，学术界和工业界已经衍生出了一系列相关工作。这些工作包括但不限于构建代码搜索工具、代码推荐系统、自动化代码修复工具等。这些成果不仅极大地推动了软件开发的自动化进程，也为编程教育和技术传播提供了新的方法和工具。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集