Ujjwal-Tyagi/jihulab

Name: Ujjwal-Tyagi/jihulab
Creator: Ujjwal-Tyagi
Published: 2026-03-30 11:51:27
License: 暂无描述

Hugging Face2026-03-30 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Ujjwal-Tyagi/jihulab

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - found language: - code - zh - en license: other multilinguality: - multilingual pretty_name: JihuLab Code Dataset size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-generation tags: - code - chinese configs: - config_name: default data_files: - split: train path: "Data/*.parquet" default: true dataset_info: features: - name: code dtype: string - name: repo_name dtype: string - name: path dtype: string - name: language dtype: string - name: license dtype: string - name: size dtype: int64 --- # JihuLab Code Dataset A comprehensive code dataset compiled from [JihuLab](https://jihulab.com), GitLab's official Chinese distribution and the primary GitLab instance for Chinese developers and enterprises. This dataset is specifically designed to support training code models with strong Chinese language understanding and enterprise-level coding practices. --- ## Overview The JihuLab Code Dataset represents a significant code corpus from China's official GitLab platform, capturing both open-source and enterprise projects across 304 programming languages. It serves as a valuable resource for developing multilingual code understanding models tailored to Chinese developers and organizations, with compliance to Chinese regulatory requirements. ### Key Statistics | Metric | Value | |--------|-------| | Total Files | 1,853,253 | | Total Repositories | 11,589 | | Compressed Size | 1.5 GB (Parquet with Zstd) | | Uncompressed Size | 12.76 GB | | Programming Languages | 304 | | File Format | Single Parquet file | --- ## Dataset Characteristics ### Scope and Coverage This dataset captures code from over 11,500 repositories hosted on JihuLab, including: - **Chinese enterprise ecosystem**: Extensive coverage of code from Chinese developers and enterprises, featuring Chinese comments, documentation, and variable naming conventions - **Diverse language ecosystem**: Support for 304 distinct programming languages - **Developer and enterprise projects**: A comprehensive mix of individual developer projects and enterprise-grade codebases - **GitLab-native practices**: Code following GitLab workflows and best practices - **Quality-assured**: Rigorously filtered to exclude vendor code, build artifacts, generated files, and low-quality content ### Programming Languages The dataset encompasses 304 languages. The 30 most represented languages by file count are: | Rank | Language | File Count | |------|----------|------------| | 1 | Java | 348,517 | | 2 | C | 209,924 | | 3 | JavaScript | 191,164 | | 4 | Python | 172,798 | | 5 | C++ | 136,046 | | 6 | Go | 80,000 | | 7 | TypeScript | 79,067 | | 8 | HTML | 69,173 | | 9 | C# | 64,511 | | 10 | Rust | 50,515 | | 11 | Shell | 43,352 | | 12 | Vue | 40,687 | | 13 | TSX | 36,844 | | 14 | CSS | 34,779 | | 15 | Makefile | 26,227 | | 16 | Ruby | 25,812 | | 17 | PHP | 21,401 | | 18 | CMake | 15,292 | | 19 | Kotlin | 14,220 | | 20 | BitBake | 13,060 | | 21 | SCSS | 10,957 | | 22 | Scala | 9,333 | | 23 | Dart | 9,125 | | 24 | Lua | 7,413 | | 25 | ASP.NET | 7,005 | | 26 | Vim Script | 5,710 | | 27 | Unix Assembly | 5,239 | | 28 | Starlark | 5,134 | | 29 | Objective-C | 4,931 | | 30 | Factor | 4,920 | ### License Distribution Files are distributed across various open-source licenses. Repositories with restrictive terms (CC-BY-ND, Commons Clause, SSPL) have been excluded to ensure broader usability. | License | File Count | |---------|------------| | Apache 2.0 | 551,008 | | Unknown | 535,320 | | MIT | 320,834 | | AGPL 3.0 | 169,922 | | GPL 2.0 | 112,829 | | BSD | 65,104 | | CC0 1.0 | 13,557 | | LGPL 3.0 | 12,871 | | LGPL 2.1 | 9,960 | | BSD-3-Clause | 9,109 | | BSL 1.1 | 8,972 | | EPL 1.0 | 7,494 | | GPL 3.0 | 7,476 | | Unlicense | 6,265 | | CC-BY 3.0 | 4,717 | | CC-BY-NC-SA | 4,339 | | MPL 2.0 | 3,847 | | CC-BY 4.0 | 2,459 | | CC-BY-NC-SA 4.0 | 1,715 | | CC-BY-SA 4.0 | 1,701 | | BSD-2-Clause | 1,599 | | CC-BY-NC-ND 4.0 | 1,222 | | ISC | 520 | | WTFPL | 274 | | CC-BY-NC 4.0 | 122 | | CC-BY-SA | 13 | | CC-BY-SA 3.0 | 4 | --- ## Dataset Structure ### Data Fields Each record contains six fields providing comprehensive metadata and content information: | Field | Type | Description | |-------|------|-------------| | `code` | string | The complete source code content in UTF-8 encoding | | `repo_name` | string | Repository identifier in the format `username/repo` or `group/subgroup/repo` | | `path` | string | File path relative to the repository root | | `language` | string | Programming language identified using [go-enry](https://github.com/go-enry/go-enry) | | `license` | string | Repository license (SPDX identifier or "unknown") | | `size` | int64 | File size in bytes | ### Sample Record ```json { "code": "package com.example.demo;\n\nimport org.springframework.boot.*;\nimport org.springframework.boot.autoconfigure.*;\n...", "repo_name": "SmallQ/demo", "path": "src/main/java/com/example/demo/DemoApplication.java", "language": "Java", "license": "unknown", "size": 400 } ``` ### File Format - **Format**: Apache Parquet with Zstd compression - **Structure**: Single consolidated file (`data.parquet`) - **Encoding**: UTF-8 - **Split**: All examples are included in a single training split (no validation or test splits) --- ## Data Creation Process ### Pipeline Stages The dataset was constructed through a systematic multi-stage pipeline: 1. **Repository Discovery** – Enumeration of public repositories using JihuLab's GitLab API (`/api/v4/projects`) with paginated requests 2. **Branch Selection** – Extraction of the repository's default branch, typically `main` or `master` 3. **Repository Cloning** – Download of repository archives via JihuLab's archive endpoint 4. **Content Extraction and Filtering** – Intelligent extraction and quality filtering of source code files 5. **Parquet Serialization** – Writing processed records to compressed Parquet format ### Language Detection Programming languages are identified using [go-enry](https://github.com/go-enry/go-enry), a Go implementation of GitHub's Linguist classification system. Only files classified as **Programming** or **Markup** types are retained; Data and Prose file types are excluded. ### License Detection License identification follows a three-step process: 1. Scan for license files: `LICENSE`, `LICENSE.txt`, `LICENSE.md`, `COPYING`, and similar variants 2. Match license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, etc.) 3. Default to "unknown" if no license match is found **Excluded Licenses**: The following restrictive licenses are filtered out to ensure broad usability: - Creative Commons No-Derivatives: `cc-by-nd`, `cc-by-nd-2.0`, `cc-by-nd-3.0`, `cc-by-nd-4.0` - `commons-clause` - Server Side Public License: `sspl`, `sspl-1.0` ### Quality Filtering Extensive filtering mechanisms ensure dataset quality and usability: #### Size Constraints | Constraint | Limit | |-----------|-------| | Maximum repository compressed size | 48 MB | | Maximum single file size | 1 MB | | Maximum line length | 1,000 characters | #### Excluded Directories **Version Control and IDE Configuration** - `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/` **Dependencies and Vendor Code** - `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/` **Build Artifacts and Output** - `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/` #### Excluded Files **Dependency Lock Files** - `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock` **Minified Code** - Any file containing `.min.` in the filename **Binary and Non-Code Files** - Executables: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj` - Java archives: `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm` - Documents: `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx` - Archives: `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar` - Media: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac` - Fonts: `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot` **System Files** - `.DS_Store`, `thumbs.db` #### Content Validation Files must meet the following criteria to be included: - **Text Encoding**: Valid UTF-8 encoding required - **Binary Detection**: Files identified as binary by go-enry are excluded - **Auto-generation Markers**: Files with generation indicators in the first 500 bytes are filtered out: - Markers: `generated by`, `do not edit`, `auto-generated`, `autogenerated`, `automatically generated`, `code generator`, `generated code`, `this file is generated`, `@generated`, `<auto-generated` - **Content Quality**: Empty files or those containing only whitespace are excluded - **Line Length**: Files with any line exceeding 1,000 characters are excluded - **Advanced Filtering**: Additional go-enry checks exclude vendor code, images, dotfiles, test files, and detected generated code - **Repository Type**: Repositories containing only documentation are skipped --- ## Usage Considerations ### Data Privacy and Security The dataset may contain sensitive information that requires careful handling: - **Email Addresses**: Present in code comments, documentation, or configuration files - **Credentials**: Accidentally committed API keys or authentication tokens - **Personal Information**: Names, phone numbers, and other identifiable data in comments or documentation Users should implement appropriate filtering and anonymization when preparing data for model training. ### Licensing and Attribution This dataset aggregates source code from repositories with diverse licenses. Any use of code or data derived from this dataset must comply with the original repository licenses, including attribution requirements where applicable. The `license` field in each record indicates the license of the source repository. Users are responsible for: - Reviewing applicable license terms - Providing proper attribution when required - Ensuring compliance with license restrictions --- ## Technical Details **Source**: Public repositories hosted on [JihuLab](https://jihulab.com) **Annotations**: Machine-generated (language detection, license identification) **Multilingual Support**: Includes multilingual code and documentation with emphasis on Chinese content **Task Categories**: Text generation, code modeling, language understanding **Tags**: Code, Chinese language, multilingual, enterprise development ---

提供机构：

Ujjwal-Tyagi

5,000+

优质数据集

54 个

任务类型

进入经典数据集