gitgud-code
收藏魔搭社区2025-09-29 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/nyuuzyou/gitgud-code
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for gitgud.io Code Dataset
### Dataset Summary
This dataset contains source code files collected from 26,610 repositories and branches hosted on gitgud.io, a free code hosting platform. The dataset includes code from various programming languages and represents a diverse collection of open-source projects and personal repositories.
## Dataset Structure
### Data Fields
This dataset includes the following fields:
- `code`: Content of the source file (string)
- `repo_name`: Name of the gitgud.io repository (string)
- `path`: Path of file within the repository (string)
- `language`: Programming language as inferred by file extension (string)
- `license`: License of the repository, if available (string)
- `size`: Size of source file in bytes (integer)
### Data Splits
All examples are in the train split, there is no validation split.
### Data Format
- **Format**: JSONL (JSON Lines) compressed with Zstandard (.jsonl.zst)
- **File Structure**: 194 files (gitgud_0000.jsonl.zst to gitgud_0193.jsonl.zst)
- **Total Repositories**: 26,610 repositories and branches
- **Filtering**: Files with lines longer than 1,000 characters were excluded
- **Deduplication**: No deduplication was performed on the dataset
# gitgud.io 代码数据集卡片
## 数据集摘要
本数据集收录自免费代码托管平台gitgud.io上的26,610个仓库与分支的源代码文件,涵盖多种编程语言,包含多样化的开源项目与个人仓库集合。
## 数据集结构
### 数据字段
本数据集包含以下字段:
- `code`:源代码文件内容(字符串类型)
- `repo_name`:gitgud.io 仓库名称(字符串类型)
- `path`:文件在仓库内的路径(字符串类型)
- `language`:通过文件扩展名推断得到的编程语言(字符串类型)
- `license`:仓库对应的开源许可证(如可获取)(字符串类型)
- `size`:源代码文件的字节大小(整数类型)
### 数据划分
所有样本均归属训练集,无验证集划分。
### 数据格式
- **格式**:采用Zstandard压缩的JSON Lines(JSONL)格式(文件后缀为`.jsonl.zst`)
- **文件结构**:共194个数据文件,命名格式为`gitgud_0000.jsonl.zst`至`gitgud_0193.jsonl.zst`
- **总仓库数**:26,610个仓库与分支
- **筛选规则**:排除了单行长度超过1,000字符的文件
- **去重处理**:本数据集未执行任何去重操作
提供机构:
maas
创建时间:
2025-07-19



