Ujjwal-Tyagi/jihulab
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Ujjwal-Tyagi/jihulab
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language_creators:
- found
language:
- code
- zh
- en
license: other
multilinguality:
- multilingual
pretty_name: JihuLab Code Dataset
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- text-generation
tags:
- code
- chinese
configs:
- config_name: default
data_files:
- split: train
path: "Data/*.parquet"
default: true
dataset_info:
features:
- name: code
dtype: string
- name: repo_name
dtype: string
- name: path
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: size
dtype: int64
---
# JihuLab Code Dataset
A comprehensive code dataset compiled from [JihuLab](https://jihulab.com), GitLab's official Chinese distribution and the primary GitLab instance for Chinese developers and enterprises. This dataset is specifically designed to support training code models with strong Chinese language understanding and enterprise-level coding practices.
---
## Overview
The JihuLab Code Dataset represents a significant code corpus from China's official GitLab platform, capturing both open-source and enterprise projects across 304 programming languages. It serves as a valuable resource for developing multilingual code understanding models tailored to Chinese developers and organizations, with compliance to Chinese regulatory requirements.
### Key Statistics
| Metric | Value |
|--------|-------|
| Total Files | 1,853,253 |
| Total Repositories | 11,589 |
| Compressed Size | 1.5 GB (Parquet with Zstd) |
| Uncompressed Size | 12.76 GB |
| Programming Languages | 304 |
| File Format | Single Parquet file |
---
## Dataset Characteristics
### Scope and Coverage
This dataset captures code from over 11,500 repositories hosted on JihuLab, including:
- **Chinese enterprise ecosystem**: Extensive coverage of code from Chinese developers and enterprises, featuring Chinese comments, documentation, and variable naming conventions
- **Diverse language ecosystem**: Support for 304 distinct programming languages
- **Developer and enterprise projects**: A comprehensive mix of individual developer projects and enterprise-grade codebases
- **GitLab-native practices**: Code following GitLab workflows and best practices
- **Quality-assured**: Rigorously filtered to exclude vendor code, build artifacts, generated files, and low-quality content
### Programming Languages
The dataset encompasses 304 languages. The 30 most represented languages by file count are:
| Rank | Language | File Count |
|------|----------|------------|
| 1 | Java | 348,517 |
| 2 | C | 209,924 |
| 3 | JavaScript | 191,164 |
| 4 | Python | 172,798 |
| 5 | C++ | 136,046 |
| 6 | Go | 80,000 |
| 7 | TypeScript | 79,067 |
| 8 | HTML | 69,173 |
| 9 | C# | 64,511 |
| 10 | Rust | 50,515 |
| 11 | Shell | 43,352 |
| 12 | Vue | 40,687 |
| 13 | TSX | 36,844 |
| 14 | CSS | 34,779 |
| 15 | Makefile | 26,227 |
| 16 | Ruby | 25,812 |
| 17 | PHP | 21,401 |
| 18 | CMake | 15,292 |
| 19 | Kotlin | 14,220 |
| 20 | BitBake | 13,060 |
| 21 | SCSS | 10,957 |
| 22 | Scala | 9,333 |
| 23 | Dart | 9,125 |
| 24 | Lua | 7,413 |
| 25 | ASP.NET | 7,005 |
| 26 | Vim Script | 5,710 |
| 27 | Unix Assembly | 5,239 |
| 28 | Starlark | 5,134 |
| 29 | Objective-C | 4,931 |
| 30 | Factor | 4,920 |
### License Distribution
Files are distributed across various open-source licenses. Repositories with restrictive terms (CC-BY-ND, Commons Clause, SSPL) have been excluded to ensure broader usability.
| License | File Count |
|---------|------------|
| Apache 2.0 | 551,008 |
| Unknown | 535,320 |
| MIT | 320,834 |
| AGPL 3.0 | 169,922 |
| GPL 2.0 | 112,829 |
| BSD | 65,104 |
| CC0 1.0 | 13,557 |
| LGPL 3.0 | 12,871 |
| LGPL 2.1 | 9,960 |
| BSD-3-Clause | 9,109 |
| BSL 1.1 | 8,972 |
| EPL 1.0 | 7,494 |
| GPL 3.0 | 7,476 |
| Unlicense | 6,265 |
| CC-BY 3.0 | 4,717 |
| CC-BY-NC-SA | 4,339 |
| MPL 2.0 | 3,847 |
| CC-BY 4.0 | 2,459 |
| CC-BY-NC-SA 4.0 | 1,715 |
| CC-BY-SA 4.0 | 1,701 |
| BSD-2-Clause | 1,599 |
| CC-BY-NC-ND 4.0 | 1,222 |
| ISC | 520 |
| WTFPL | 274 |
| CC-BY-NC 4.0 | 122 |
| CC-BY-SA | 13 |
| CC-BY-SA 3.0 | 4 |
---
## Dataset Structure
### Data Fields
Each record contains six fields providing comprehensive metadata and content information:
| Field | Type | Description |
|-------|------|-------------|
| `code` | string | The complete source code content in UTF-8 encoding |
| `repo_name` | string | Repository identifier in the format `username/repo` or `group/subgroup/repo` |
| `path` | string | File path relative to the repository root |
| `language` | string | Programming language identified using [go-enry](https://github.com/go-enry/go-enry) |
| `license` | string | Repository license (SPDX identifier or "unknown") |
| `size` | int64 | File size in bytes |
### Sample Record
```json
{
"code": "package com.example.demo;\n\nimport org.springframework.boot.*;\nimport org.springframework.boot.autoconfigure.*;\n...",
"repo_name": "SmallQ/demo",
"path": "src/main/java/com/example/demo/DemoApplication.java",
"language": "Java",
"license": "unknown",
"size": 400
}
```
### File Format
- **Format**: Apache Parquet with Zstd compression
- **Structure**: Single consolidated file (`data.parquet`)
- **Encoding**: UTF-8
- **Split**: All examples are included in a single training split (no validation or test splits)
---
## Data Creation Process
### Pipeline Stages
The dataset was constructed through a systematic multi-stage pipeline:
1. **Repository Discovery** – Enumeration of public repositories using JihuLab's GitLab API (`/api/v4/projects`) with paginated requests
2. **Branch Selection** – Extraction of the repository's default branch, typically `main` or `master`
3. **Repository Cloning** – Download of repository archives via JihuLab's archive endpoint
4. **Content Extraction and Filtering** – Intelligent extraction and quality filtering of source code files
5. **Parquet Serialization** – Writing processed records to compressed Parquet format
### Language Detection
Programming languages are identified using [go-enry](https://github.com/go-enry/go-enry), a Go implementation of GitHub's Linguist classification system. Only files classified as **Programming** or **Markup** types are retained; Data and Prose file types are excluded.
### License Detection
License identification follows a three-step process:
1. Scan for license files: `LICENSE`, `LICENSE.txt`, `LICENSE.md`, `COPYING`, and similar variants
2. Match license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, etc.)
3. Default to "unknown" if no license match is found
**Excluded Licenses**: The following restrictive licenses are filtered out to ensure broad usability:
- Creative Commons No-Derivatives: `cc-by-nd`, `cc-by-nd-2.0`, `cc-by-nd-3.0`, `cc-by-nd-4.0`
- `commons-clause`
- Server Side Public License: `sspl`, `sspl-1.0`
### Quality Filtering
Extensive filtering mechanisms ensure dataset quality and usability:
#### Size Constraints
| Constraint | Limit |
|-----------|-------|
| Maximum repository compressed size | 48 MB |
| Maximum single file size | 1 MB |
| Maximum line length | 1,000 characters |
#### Excluded Directories
**Version Control and IDE Configuration**
- `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/`
**Dependencies and Vendor Code**
- `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/`
**Build Artifacts and Output**
- `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/`
#### Excluded Files
**Dependency Lock Files**
- `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock`
**Minified Code**
- Any file containing `.min.` in the filename
**Binary and Non-Code Files**
- Executables: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj`
- Java archives: `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm`
- Documents: `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`
- Archives: `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`
- Media: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac`
- Fonts: `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot`
**System Files**
- `.DS_Store`, `thumbs.db`
#### Content Validation
Files must meet the following criteria to be included:
- **Text Encoding**: Valid UTF-8 encoding required
- **Binary Detection**: Files identified as binary by go-enry are excluded
- **Auto-generation Markers**: Files with generation indicators in the first 500 bytes are filtered out:
- Markers: `generated by`, `do not edit`, `auto-generated`, `autogenerated`, `automatically generated`, `code generator`, `generated code`, `this file is generated`, `@generated`, `<auto-generated`
- **Content Quality**: Empty files or those containing only whitespace are excluded
- **Line Length**: Files with any line exceeding 1,000 characters are excluded
- **Advanced Filtering**: Additional go-enry checks exclude vendor code, images, dotfiles, test files, and detected generated code
- **Repository Type**: Repositories containing only documentation are skipped
---
## Usage Considerations
### Data Privacy and Security
The dataset may contain sensitive information that requires careful handling:
- **Email Addresses**: Present in code comments, documentation, or configuration files
- **Credentials**: Accidentally committed API keys or authentication tokens
- **Personal Information**: Names, phone numbers, and other identifiable data in comments or documentation
Users should implement appropriate filtering and anonymization when preparing data for model training.
### Licensing and Attribution
This dataset aggregates source code from repositories with diverse licenses. Any use of code or data derived from this dataset must comply with the original repository licenses, including attribution requirements where applicable.
The `license` field in each record indicates the license of the source repository. Users are responsible for:
- Reviewing applicable license terms
- Providing proper attribution when required
- Ensuring compliance with license restrictions
---
## Technical Details
**Source**: Public repositories hosted on [JihuLab](https://jihulab.com)
**Annotations**: Machine-generated (language detection, license identification)
**Multilingual Support**: Includes multilingual code and documentation with emphasis on Chinese content
**Task Categories**: Text generation, code modeling, language understanding
**Tags**: Code, Chinese language, multilingual, enterprise development
---
提供机构:
Ujjwal-Tyagi



