github-code
收藏魔搭社区2026-05-07 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/swift/github-code
下载链接
链接失效反馈官方服务:
资源简介:
# GitHub Code Dataset
## Dataset Description
The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.
### How to use it
The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of `datasets`. You can load and iterate through the dataset with the following two lines of code:
```python
from datasets import load_dataset
ds = load_dataset("codeparrot/github-code", streaming=True, split="train")
print(next(iter(ds)))
#OUTPUT:
{
'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
'repo_name': 'MirekSz/webpack-es6-ts',
'path': 'app/mods/mod190.js',
'language': 'JavaScript',
'license': 'isc',
'size': 73
}
```
You can see that besides the code, repo name, and path also the programming language, license, and the size of the file are part of the dataset. You can also filter the dataset for any subset of the 30 included languages (see the full list below) in the dataset. Just pass the list of languages as a list. E.g. if your dream is to build a Codex model for Dockerfiles use the following configuration:
```python
ds = load_dataset("codeparrot/github-code", streaming=True, split="train", languages=["Dockerfile"])
print(next(iter(ds))["code"])
#OUTPUT:
"""\
FROM rockyluke/ubuntu:precise
ENV DEBIAN_FRONTEND="noninteractive" \
TZ="Europe/Amsterdam"
...
"""
```
We also have access to the license of the origin repo of a file so we can filter for licenses in the same way we filtered for languages:
```python
ds = load_dataset("codeparrot/github-code", streaming=True, split="train", licenses=["mit", "isc"])
licenses = []
for element in iter(ds).take(10_000):
licenses.append(element["license"])
print(Counter(licenses))
#OUTPUT:
Counter({'mit': 9896, 'isc': 104})
```
Naturally, you can also download the full dataset. Note that this will download ~300GB compressed text data and the uncompressed dataset will take up ~1TB of storage:
```python
ds = load_dataset("codeparrot/github-code", split="train")
```
## Data Structure
### Data Instances
```python
{
'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
'repo_name': 'MirekSz/webpack-es6-ts',
'path': 'app/mods/mod190.js',
'language': 'JavaScript',
'license': 'isc',
'size': 73
}
```
### Data Fields
|Field|Type|Description|
|---|---|---|
|code|string|content of source file|
|repo_name|string|name of the GitHub repository|
|path|string|path of file in GitHub repository|
|language|string|programming language as inferred by extension|
|license|string|license of GitHub repository|
|size|int|size of source file in bytes|
### Data Splits
The dataset only contains a train split.
## Languages
The dataset contains 30 programming languages with over 60 extensions:
```python
{
"Assembly": [".asm"],
"Batchfile": [".bat", ".cmd"],
"C": [".c", ".h"],
"C#": [".cs"],
"C++": [".cpp", ".hpp", ".c++", ".h++", ".cc", ".hh", ".C", ".H"],
"CMake": [".cmake"],
"CSS": [".css"],
"Dockerfile": [".dockerfile", "Dockerfile"],
"FORTRAN": ['.f90', '.f', '.f03', '.f08', '.f77', '.f95', '.for', '.fpp'],
"GO": [".go"],
"Haskell": [".hs"],
"HTML":[".html"],
"Java": [".java"],
"JavaScript": [".js"],
"Julia": [".jl"],
"Lua": [".lua"],
"Makefile": ["Makefile"],
"Markdown": [".md", ".markdown"],
"PHP": [".php", ".php3", ".php4", ".php5", ".phps", ".phpt"],
"Perl": [".pl", ".pm", ".pod", ".perl"],
"PowerShell": ['.ps1', '.psd1', '.psm1'],
"Python": [".py"],
"Ruby": [".rb"],
"Rust": [".rs"],
"SQL": [".sql"],
"Scala": [".scala"],
"Shell": [".sh", ".bash", ".command", ".zsh"],
"TypeScript": [".ts", ".tsx"],
"TeX": [".tex"],
"Visual Basic": [".vb"]
}
```
## Licenses
Each example is also annotated with the license of the associated repository. There are in total 15 licenses:
```python
[
'mit',
'apache-2.0',
'gpl-3.0',
'gpl-2.0',
'bsd-3-clause',
'agpl-3.0',
'lgpl-3.0',
'lgpl-2.1',
'bsd-2-clause',
'cc0-1.0',
'epl-1.0',
'mpl-2.0',
'unlicense',
'isc',
'artistic-2.0'
]
```
## Dataset Statistics
The dataset contains 115M files and the sum of all the source code file sizes is 873 GB (note that the size of the dataset is larger due to the extra fields). A breakdown per language is given in the plot and table below:

| | Language |File Count| Size (GB)|
|---:|:-------------|---------:|-------:|
| 0 | Java | 19548190 | 107.70 |
| 1 | C | 14143113 | 183.83 |
| 2 | JavaScript | 11839883 | 87.82 |
| 3 | HTML | 11178557 | 118.12 |
| 4 | PHP | 11177610 | 61.41 |
| 5 | Markdown | 8464626 | 23.09 |
| 6 | C++ | 7380520 | 87.73 |
| 7 | Python | 7226626 | 52.03 |
| 8 | C# | 6811652 | 36.83 |
| 9 | Ruby | 4473331 | 10.95 |
| 10 | GO | 2265436 | 19.28 |
| 11 | TypeScript | 1940406 | 24.59 |
| 12 | CSS | 1734406 | 22.67 |
| 13 | Shell | 1385648 | 3.01 |
| 14 | Scala | 835755 | 3.87 |
| 15 | Makefile | 679430 | 2.92 |
| 16 | SQL | 656671 | 5.67 |
| 17 | Lua | 578554 | 2.81 |
| 18 | Perl | 497949 | 4.70 |
| 19 | Dockerfile | 366505 | 0.71 |
| 20 | Haskell | 340623 | 1.85 |
| 21 | Rust | 322431 | 2.68 |
| 22 | TeX | 251015 | 2.15 |
| 23 | Batchfile | 236945 | 0.70 |
| 24 | CMake | 175282 | 0.54 |
| 25 | Visual Basic | 155652 | 1.91 |
| 26 | FORTRAN | 142038 | 1.62 |
| 27 | PowerShell | 136846 | 0.69 |
| 28 | Assembly | 82905 | 0.78 |
| 29 | Julia | 58317 | 0.29 |
## Dataset Creation
The dataset was created in two steps:
1. Files of with the extensions given in the list above were retrieved from the GitHub dataset on BigQuery (full query [here](https://huggingface.co/datasets/codeparrot/github-code/blob/main/query.sql)). The query was executed on _Mar 16, 2022, 6:23:39 PM UTC+1_.
2. Files with lines longer than 1000 characters and duplicates (exact duplicates ignoring whitespaces) were dropped (full preprocessing script [here](https://huggingface.co/datasets/codeparrot/github-code/blob/main/github_preprocessing.py)).
## Considerations for Using the Data
The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.
## Releases
You can load any older version of the dataset with the `revision` argument:
```Python
ds = load_dataset("codeparrot/github-code", revision="v1.0")
```
### v1.0
- Initial release of dataset
- The query was executed on _Feb 14, 2022, 12:03:16 PM UTC+1_
### v1.1
- Fix missing Scala/TypeScript
- Fix deduplication issue with inconsistent Python `hash`
- The query was executed on _Mar 16, 2022, 6:23:39 PM UTC+1_
# GitHub代码数据集
## 数据集概览
本GitHub代码数据集包含来自GitHub的1.15亿个代码文件,涵盖32种编程语言、60余种文件扩展名,总数据量达1太字节(1TB)。该数据集源自谷歌BigQuery(Google BigQuery)上的公开GitHub数据集。
### 使用方法
由于本数据集体量庞大,针对绝大多数应用场景,建议借助`datasets`库的流式API进行处理。您可通过以下两行代码加载并遍历该数据集:
python
from datasets import load_dataset
ds = load_dataset("codeparrot/github-code", streaming=True, split="train")
print(next(iter(ds)))
#输出:
{
'code': "import mod189 from './mod189';
var value=mod189+1;
export default value;
",
'repo_name': 'MirekSz/webpack-es6-ts',
'path': 'app/mods/mod190.js',
'language': 'JavaScript',
'license': 'isc',
'size': 73
}
您可看到,除代码内容、仓库名称与文件路径外,数据集还包含编程语言、开源许可证及文件大小信息。您也可基于数据集中包含的30种编程语言(完整列表见下文)筛选出所需子集,只需将语言列表以参数形式传入即可。例如,若您希望构建针对Dockerfile的Codex类模型,可使用如下配置:
python
ds = load_dataset("codeparrot/github-code", streaming=True, split="train", languages=["Dockerfile"])
print(next(iter(ds))["code"])
#输出:
"""
FROM rockyluke/ubuntu:precise
ENV DEBIAN_FRONTEND="noninteractive"
TZ="Europe/Amsterdam"
...
"""
我们还可获取文件所属仓库的开源许可证信息,因此能够以筛选编程语言的相同方式按许可证过滤数据集:
python
ds = load_dataset("codeparrot/github-code", streaming=True, split="train", licenses=["mit", "isc"])
licenses = []
for element in iter(ds).take(10_000):
licenses.append(element["license"])
print(Counter(licenses))
#输出:
Counter({'mit': 9896, 'isc': 104})
当然,您也可下载完整数据集。请注意,该操作将下载约300吉字节的压缩文本数据,解压后数据集将占用约1太字节的存储空间:
python
ds = load_dataset("codeparrot/github-code", split="train")
## 数据结构
### 数据实例
python
{
'code': "import mod189 from './mod189';
var value=mod189+1;
export default value;
",
'repo_name': 'MirekSz/webpack-es6-ts',
'path': 'app/mods/mod190.js',
'language': 'JavaScript',
'license': 'isc',
'size': 73
}
### 数据字段
|字段|类型|描述|
|---|---|---|
|code|string|源代码文件内容|
|repo_name|string|GitHub仓库名称|
|path|string|文件在GitHub仓库中的路径|
|language|string|基于文件扩展名推断的编程语言|
|license|string|GitHub仓库的开源许可证|
|size|int|源代码文件大小(单位:字节)|
### 数据划分
本数据集仅包含训练集划分。
## 编程语言覆盖范围
本数据集涵盖30种编程语言、超60种文件扩展名:
python
{
"汇编语言": [".asm"],
"批处理文件": [".bat", ".cmd"],
"C语言": [".c", ".h"],
"C#": [".cs"],
"C++": [".cpp", ".hpp", ".c++", ".h++", ".cc", ".hh", ".C", ".H"],
"CMake": [".cmake"],
"CSS": [".css"],
"Dockerfile": [".dockerfile", "Dockerfile"],
"FORTRAN": ['.f90', '.f', '.f03', '.f08', '.f77', '.f95', '.for', '.fpp'],
"GO": [".go"],
"Haskell": [".hs"],
"HTML":[".html"],
"Java": [".java"],
"JavaScript": [".js"],
"Julia": [".jl"],
"Lua": [".lua"],
"Makefile": ["Makefile"],
"Markdown": [".md", ".markdown"],
"PHP": [".php", ".php3", ".php4", ".php5", ".phps", ".phpt"],
"Perl": [".pl", ".pm", ".pod", ".perl"],
"PowerShell": ['.ps1', '.psd1', '.psm1'],
"Python": [".py"],
"Ruby": [".rb"],
"Rust": [".rs"],
"SQL": [".sql"],
"Scala": [".scala"],
"Shell": [".sh", ".bash", ".command", ".zsh"],
"TypeScript": [".ts", ".tsx"],
"TeX": [".tex"],
"Visual Basic": [".vb"]
}
## 开源许可证
每个数据实例都标注了所属仓库的开源许可证类型,本数据集共包含15种许可证:
python
[
'mit',
'apache-2.0',
'gpl-3.0',
'gpl-2.0',
'bsd-3-clause',
'agpl-3.0',
'lgpl-3.0',
'lgpl-2.1',
'bsd-2-clause',
'cc0-1.0',
'epl-1.0',
'mpl-2.0',
'unlicense',
'isc',
'artistic-2.0'
]
## 数据集统计信息
本数据集共包含1.15亿个代码文件,所有源代码文件的总大小为873吉字节(注:由于额外字段的存在,数据集实际占用存储空间会大于该数值)。各编程语言的细分统计信息如下表与统计图所示:

|序号|编程语言|文件数量| 大小(GB)|
|---:|:-------------|---------:|-------:|
| 0 | Java | 19548190 | 107.70 |
| 1 | C | 14143113 | 183.83 |
| 2 | JavaScript | 11839883 | 87.82 |
| 3 | HTML | 11178557 | 118.12 |
| 4 | PHP | 11177610 | 61.41 |
| 5 | Markdown | 8464626 | 23.09 |
| 6 | C++ | 7380520 | 87.73 |
| 7 | Python | 7226626 | 52.03 |
| 8 | C# | 6811652 | 36.83 |
| 9 | Ruby | 4473331 | 10.95 |
| 10 | GO | 2265436 | 19.28 |
| 11 | TypeScript | 1940406 | 24.59 |
| 12 | CSS | 1734406 | 22.67 |
| 13 | Shell | 1385648 | 3.01 |
| 14 | Scala | 835755 | 3.87 |
| 15 | Makefile | 679430 | 2.92 |
| 16 | SQL | 656671 | 5.67 |
| 17 | Lua | 578554 | 2.81 |
| 18 | Perl | 497949 | 4.70 |
| 19 | Dockerfile | 366505 | 0.71 |
| 20 | Haskell | 340623 | 1.85 |
| 21 | Rust | 322431 | 2.68 |
| 22 | TeX | 251015 | 2.15 |
| 23 | Batchfile | 236945 | 0.70 |
| 24 | CMake | 175282 | 0.54 |
| 25 | Visual Basic | 155652 | 1.91 |
| 26 | FORTRAN | 142038 | 1.62 |
| 27 | PowerShell | 136846 | 0.69 |
| 28 | Assembly | 82905 | 0.78 |
| 29 | Julia | 58317 | 0.29 |
## 数据集构建流程
本数据集通过两个步骤构建完成:
1. 从BigQuery上的GitHub数据集中检索符合上述扩展名列表的代码文件(完整查询语句见[此处](https://huggingface.co/datasets/codeparrot/github-code/blob/main/query.sql)),查询执行时间为**2022年3月16日,18:23:39 UTC+1**。
2. 移除行数超过1000的文件及重复文件(忽略空白字符后的精确重复),完整预处理脚本见[此处](https://huggingface.co/datasets/codeparrot/github-code/blob/main/github_preprocessing.py)。
## 数据集使用注意事项
本数据集包含来自各类开源仓库的源代码,因此可能包含有害或存在偏见的代码,亦可能泄露密码、用户名等敏感信息。
## 版本发布
您可通过`revision`参数加载数据集的历史版本:
python
ds = load_dataset("codeparrot/github-code", revision="v1.0")
### v1.0
- 数据集的初始发布版本
- 查询执行时间为**2022年2月14日,12:03:16 UTC+1**
### v1.1
- 修复了Scala/TypeScript语言缺失的问题
- 修复了Python哈希值不一致导致的去重问题
- 查询执行时间为**2022年3月16日,18:23:39 UTC+1**
提供机构:
maas
创建时间:
2024-06-06



