five

nyuuzyou/jihulab-code

收藏
Hugging Face2026-01-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nyuuzyou/jihulab-code
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - found language: - code - zh - en license: other multilinguality: - multilingual pretty_name: JihuLab Code Dataset size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-generation tags: - code - chinese configs: - config_name: default data_files: - split: train path: "data.parquet" default: true dataset_info: features: - name: code dtype: string - name: repo_name dtype: string - name: path dtype: string - name: language dtype: string - name: license dtype: string - name: size dtype: int64 --- # JihuLab Code Dataset ## Dataset Description This dataset was compiled from code repositories hosted on [JihuLab](https://jihulab.com), a GitLab-based code hosting platform operated by JiHu (GitLab's Chinese joint venture). JihuLab serves as the primary GitLab instance for Chinese developers and enterprises, offering localized services and compliance with Chinese regulations. This dataset is particularly valuable for training code models with Chinese language understanding and enterprise-level coding practices. ### Dataset Summary | Statistic | Value | |-----------|-------| | **Total Files** | 1,853,253 | | **Total Repositories** | 11,589 | | **Total Size** | 1.5 GB (compressed Parquet) / 12.76 GB (uncompressed) | | **Programming Languages** | 304 | | **File Format** | Parquet with Zstd compression | ### Key Features - **Chinese developer ecosystem**: Contains code from JihuLab, GitLab's official Chinese distribution, featuring Chinese comments, documentation, and variable names - **Diverse language coverage**: Spans 304 programming languages identified by [go-enry](https://github.com/go-enry/go-enry) (based on GitHub Linguist rules) - **Rich metadata**: Includes repository name, file path, detected language, license information, and file size - **Enterprise and open-source projects**: Includes code from both individual developers and Chinese enterprises using GitLab - **Quality filtered**: Extensive filtering to remove vendor code, build artifacts, generated files, and low-quality content ### Languages The dataset includes 304 programming languages. The top 30 languages by file count: | Rank | Language | File Count | |------|----------|------------| | 1 | Java | 348,517 | | 2 | C | 209,924 | | 3 | JavaScript | 191,164 | | 4 | Python | 172,798 | | 5 | C++ | 136,046 | | 6 | Go | 80,000 | | 7 | TypeScript | 79,067 | | 8 | HTML | 69,173 | | 9 | C# | 64,511 | | 10 | Rust | 50,515 | | 11 | Shell | 43,352 | | 12 | Vue | 40,687 | | 13 | TSX | 36,844 | | 14 | CSS | 34,779 | | 15 | Makefile | 26,227 | | 16 | Ruby | 25,812 | | 17 | PHP | 21,401 | | 18 | CMake | 15,292 | | 19 | Kotlin | 14,220 | | 20 | BitBake | 13,060 | | 21 | SCSS | 10,957 | | 22 | Scala | 9,333 | | 23 | Dart | 9,125 | | 24 | Lua | 7,413 | | 25 | ASP.NET | 7,005 | | 26 | Vim Script | 5,710 | | 27 | Unix Assembly | 5,239 | | 28 | Starlark | 5,134 | | 29 | Objective-C | 4,931 | | 30 | Factor | 4,920 | ### Licenses The dataset includes files from repositories with various licenses. Repositories with restrictive licenses (CC-BY-ND variants, Commons Clause, SSPL) were excluded: | License | File Count | |---------|------------| | apache-2.0 | 551,008 | | unknown | 535,320 | | mit | 320,834 | | agpl-3.0 | 169,922 | | gpl-2.0 | 112,829 | | bsd | 65,104 | | cc0-1.0 | 13,557 | | lgpl-3.0 | 12,871 | | lgpl-2.1 | 9,960 | | bsd-3-clause | 9,109 | | bsl-1.1 | 8,972 | | epl-1.0 | 7,494 | | gpl-3.0 | 7,476 | | unlicense | 6,265 | | cc-by-3.0 | 4,717 | | cc-by-nc-sa | 4,339 | | mpl-2.0 | 3,847 | | cc-by-4.0 | 2,459 | | cc-by-nc-sa-4.0 | 1,715 | | cc-by-sa-4.0 | 1,701 | | bsd-2-clause | 1,599 | | cc-by-nc-nd-4.0 | 1,222 | | isc | 520 | | wtfpl | 274 | | cc-by-nc-4.0 | 122 | | cc-by-sa | 13 | | cc-by-sa-3.0 | 4 | ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `code` | string | Content of the source file (UTF-8 encoded) | | `repo_name` | string | Name of the JihuLab repository (format: `username/repo` or `group/subgroup/repo`) | | `path` | string | Path of the file within the repository (relative to repo root) | | `language` | string | Programming language as identified by [go-enry](https://github.com/go-enry/go-enry) | | `license` | string | License of the repository (SPDX identifier or "unknown") | | `size` | int64 | Size of the source file in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: Single consolidated file (`data.parquet`) ### Data Splits All examples are in the train split. There is no validation or test split. ### Example Data Point ``` { 'code': 'package com.example.demo;\n\nimport org.springframework.boot.*;\nimport org.springframework.boot.autoconfigure.*;\n...', 'repo_name': 'SmallQ/demo', 'path': 'src/main/java/com/example/demo/DemoApplication.java', 'language': 'Java', 'license': 'unknown', 'size': 400 } ``` ## Dataset Creation ### Pipeline Overview The dataset was created through a multi-stage pipeline: 1. **Repository Discovery**: Paginated API requests to JihuLab's GitLab API (`/api/v4/projects`) to enumerate public repositories 2. **Branch Selection**: Using the repository's default branch (typically `main` or `master`) 3. **Repository Downloading**: Downloading repository archives via JihuLab's archive endpoint 4. **Content Extraction**: Extracting and filtering source code files 5. **Parquet Generation**: Writing filtered records to Parquet with Zstd compression ### Language Detection Programming languages are detected using [go-enry](https://github.com/go-enry/go-enry), a Go port of GitHub's Linguist library. Only files classified as **Programming** or **Markup** language types are included (Data and Prose types are excluded). ### License Detection Licenses are detected by: 1. Scanning for license files (`LICENSE`, `LICENSE.txt`, `LICENSE.md`, `COPYING`, etc.) 2. Matching license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, etc.) 3. Defaulting to "unknown" if no license can be detected **Blocked Licenses**: The following restrictive licenses are excluded from the dataset: - `cc-by-nd`, `cc-by-nd-2.0`, `cc-by-nd-3.0`, `cc-by-nd-4.0` (Creative Commons No-Derivatives) - `commons-clause` - `sspl`, `sspl-1.0` (Server Side Public License) ### File Filtering Extensive filtering is applied to ensure data quality: #### Size Limits | Limit | Value | |-------|-------| | Max repository ZIP size | 48 MB | | Max single file size | 1 MB | | Max line length | 1,000 characters | #### Excluded Directories - **Configuration**: `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/` - **Vendor/Dependencies**: `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/` - **Build Output**: `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/` #### Excluded Files - **Lock Files**: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock` - **Minified Files**: Any file containing `.min.` in the name - **Binary Files**: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj`, `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm`, `.bin`, `.dat`, `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`, `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`, `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac`, `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot` - **System Files**: `.DS_Store`, `thumbs.db` #### Content Filtering - **UTF-8 Validation**: Files must be valid UTF-8 encoded text - **Binary Detection**: Files detected as binary by go-enry are excluded - **Generated Files**: Files with generation markers in the first 500 bytes are excluded: - `generated by`, `do not edit`, `auto-generated`, `autogenerated`, `automatically generated`, `code generator`, `generated code`, `this file is generated`, `@generated`, `<auto-generated` - **Empty Files**: Files that are empty or contain only whitespace are excluded - **Long Lines**: Files with any line exceeding 1,000 characters are excluded - **go-enry Filters**: Additional filtering using go-enry's `IsVendor()`, `IsImage()`, `IsDotFile()`, `IsTest()`, and `IsGenerated()` functions - **Documentation-only Repos**: Repositories containing only documentation files (no actual code) are skipped ### Source Data All data originates from public repositories hosted on [JihuLab](https://jihulab.com). ## Considerations for Using the Data ### Personal and Sensitive Information The dataset may contain: - Email addresses in code comments or configuration files - API keys or credentials that were accidentally committed - Personal information in comments or documentation Users should exercise caution and implement appropriate filtering when using this data. ### Licensing Information This dataset is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in this dataset must abide by the terms of the original licenses, including attribution clauses when relevant. The license field in each data point indicates the license of the source repository.
提供机构:
nyuuzyou
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作