Ujjwal-Tyagi/gitee

Name: Ujjwal-Tyagi/gitee
Creator: Ujjwal-Tyagi
Published: 2026-03-30 11:44:04
License: 暂无描述

Hugging Face2026-03-30 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Ujjwal-Tyagi/gitee

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - found language: - code - zh - en license: other multilinguality: - multilingual pretty_name: Gitee Code Dataset size_categories: - 100M<n<1B source_datasets: - original task_categories: - text-generation tags: - code - chinese configs: - config_name: default data_files: - split: train path: "data/*.parquet" default: true dataset_info: features: - name: code dtype: string - name: repo_name dtype: string - name: path dtype: string - name: language dtype: string - name: license dtype: string - name: size dtype: int64 --- # Gitee Code Dataset A comprehensive, large-scale code dataset compiled from [Gitee](https://gitee.com), China's premier code hosting platform. This dataset is specifically designed to support training code models with strong Chinese language understanding and authentic Chinese coding practices. --- ## Overview The Gitee Code Dataset represents one of the largest code corpora from a Chinese-native platform, capturing both open-source and enterprise projects across 554 programming languages. It serves as a critical resource for developing multilingual code understanding models tailored to Chinese developers and organizations. ### Key Statistics | Metric | Value | |--------|-------| | Total Files | 819,472,785 | | Total Repositories | 3,105,923 | | Compressed Size | 536 GB (Parquet with Zstd) | | Programming Languages | 554 | | Total Parquet Shards | 468 files | --- ## Dataset Characteristics ### Scope and Coverage This dataset captures code from over 3 million repositories hosted on Gitee, including: - **Chinese-centric content**: Extensive coverage of code written by Chinese developers, featuring Chinese comments, documentation, and variable naming conventions - **Diverse language ecosystem**: Support for 554 distinct programming languages - **Enterprise and open-source projects**: A balanced mix from individual developers to major Chinese enterprises - **Quality-assured**: Rigorously filtered to exclude vendor code, build artifacts, generated files, and low-quality content ### Programming Languages The dataset encompasses 554 languages across multiple categories. The 30 most represented languages by file count are: | Rank | Language | File Count | |------|----------|------------| | 1 | Java | 293,439,777 | | 2 | JavaScript | 77,715,425 | | 3 | C | 62,836,721 | | 4 | C++ | 49,134,251 | | 5 | HTML | 46,191,063 | | 6 | Vue | 40,468,646 | | 7 | PHP | 37,132,954 | | 8 | C# | 33,842,369 | | 9 | Python | 25,192,704 | | 10 | CSS | 20,802,464 | | 11 | TypeScript | 20,122,528 | | 12 | Go | 16,176,561 | | 13 | Shell | 8,371,429 | | 14 | Makefile | 6,341,964 | | 15 | Java Server Pages | 6,224,523 | | 16 | TSX | 5,768,542 | | 17 | CMake | 5,581,774 | | 18 | SCSS | 5,291,031 | | 19 | Objective-C | 4,922,736 | | 20 | Less | 4,669,672 | | 21 | Ruby | 3,027,385 | | 22 | Kotlin | 2,986,211 | | 23 | Scala | 2,869,640 | | 24 | Rust | 2,466,122 | | 25 | Starlark | 2,027,514 | | 26 | Dart | 2,010,079 | | 27 | Unix Assembly | 1,900,320 | | 28 | Fluent | 1,882,380 | | 29 | HTML+Razor | 1,863,914 | | 30 | Swift | 1,607,477 | ### License Distribution Files are distributed across various open-source licenses. Repositories with restrictive terms (CC-BY-ND, Commons Clause, SSPL) have been excluded to ensure broader usability. | License | File Count | |---------|------------| | Apache 2.0 | 273,706,950 | | MIT | 201,880,040 | | Unknown | 195,868,240 | | AGPL 3.0 | 60,181,320 | | BSD | 30,013,190 | | GPL 2.0 | 27,831,530 | | LGPL 3.0 | 11,746,750 | | LGPL 2.1 | 4,807,600 | | BSD-3-Clause | 4,442,480 | | CC0 1.0 | 3,144,920 | | GPL 3.0 | 1,631,590 | | Unlicense | 1,181,930 | | BSD-2-Clause | 1,154,300 | | EPL 1.0 | 1,045,470 | | Other Licenses | ~5,800,000 | --- ## Dataset Structure ### Data Fields Each record contains six fields providing comprehensive metadata and content information: | Field | Type | Description | |-------|------|-------------| | `code` | string | The complete source code content in UTF-8 encoding | | `repo_name` | string | Repository identifier in the format `username/repository_name` | | `path` | string | File path relative to the repository root | | `language` | string | Programming language identified using [go-enry](https://github.com/go-enry/go-enry) | | `license` | string | Repository license (SPDX identifier or "unknown") | | `size` | int64 | File size in bytes | ### Sample Record ```json { "code": "package com.example.demo;\n\nimport org.springframework.boot.SpringApplication;\n...", "repo_name": "username/spring-demo", "path": "src/main/java/com/example/demo/Application.java", "language": "Java", "license": "apache-2.0", "size": 1234 } ``` ### File Format - **Format**: Apache Parquet with Zstd compression - **Shards**: 468 files (`gitee_0000.parquet` through `gitee_0467.parquet`) - **Split**: All examples are included in a single training split (no validation or test splits) --- ## Data Creation Process ### Pipeline Stages The dataset was constructed through a systematic multi-stage pipeline: 1. **Repository Discovery** – Identification of relevant repositories on Gitee 2. **Branch Selection** – Extraction of the primary branch using priority order: `master` → `main` → `develop` → `dev` → first available branch 3. **Repository Cloning** – Download of selected repositories 4. **Content Extraction and Filtering** – Intelligent extraction and quality filtering of source code files 5. **Parquet Serialization** – Writing processed records to compressed Parquet shards ### Language Detection Programming languages are identified using [go-enry](https://github.com/go-enry/go-enry), a Go implementation of GitHub's Linguist classification system. Only files classified as **Programming** or **Markup** types are retained; Data and Prose file types are excluded. ### License Detection License identification follows a three-step process: 1. Scan for license files: `LICENSE`, `LICENSE.txt`, `LICENSE.md`, `COPYING`, and similar variants 2. Match license text against known patterns (MIT, Apache 2.0, GPL variants, BSD, Creative Commons, etc.) 3. Default to "unknown" if no license match is found **Excluded Licenses**: The following restrictive licenses are filtered out to ensure broad usability: - Creative Commons No-Derivatives: `cc-by-nd`, `cc-by-nd-2.0`, `cc-by-nd-3.0`, `cc-by-nd-4.0` - `commons-clause` - Server Side Public License: `sspl`, `sspl-1.0` ### Quality Filtering Extensive filtering mechanisms ensure dataset quality and usability: #### Size Constraints | Constraint | Limit | |-----------|-------| | Maximum repository compressed size | 48 MB | | Maximum single file size | 1 MB | | Maximum line length | 1,000 characters | #### Excluded Directories **Version Control and IDE Configuration** - `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/` **Dependencies and Vendor Code** - `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/` **Build Artifacts and Output** - `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/` #### Excluded Files **Dependency Lock Files** - `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock` **Minified Code** - Any file containing `.min.` in the filename **Binary and Non-Code Files** - Executables: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj` - Java archives: `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm` - Documents: `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx` - Archives: `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar` - Media: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac` - Fonts: `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot` **System Files** - `.DS_Store`, `thumbs.db` #### Content Validation Files must meet the following criteria to be included: - **Text Encoding**: Valid UTF-8 encoding required - **Binary Detection**: Files identified as binary by go-enry are excluded - **Auto-generation Markers**: Files with generation indicators in the first 500 bytes are filtered out: - Markers: `generated by`, `do not edit`, `auto-generated`, `autogenerated`, `automatically generated`, `code generator`, `generated code`, `this file is generated`, `@generated`, `<auto-generated` - **Content Quality**: Empty files or those containing only whitespace are excluded - **Line Length**: Files with any line exceeding 1,000 characters are excluded - **Advanced Filtering**: Additional go-enry checks exclude vendor code, images, dotfiles, test files, and detected generated code - **Repository Type**: Repositories containing only documentation are skipped --- ## Usage Considerations ### Data Privacy and Security The dataset may contain sensitive information that requires careful handling: - **Email Addresses**: Present in code comments, documentation, or configuration files - **Credentials**: Accidentally committed API keys or authentication tokens - **Personal Information**: Names, phone numbers, and other identifiable data in comments or documentation Users should implement appropriate filtering and anonymization when preparing data for model training. ### Licensing and Attribution This dataset aggregates source code from repositories with diverse licenses. Any use of code or data derived from this dataset must comply with the original repository licenses, including attribution requirements where applicable. The `license` field in each record indicates the license of the source repository. Users are responsible for: - Reviewing applicable license terms - Providing proper attribution when required - Ensuring compliance with license restrictions --- ## Technical Details **Source**: Public repositories hosted on [Gitee](https://gitee.com) **Annotations**: Machine-generated (language detection, license identification) **Multilingual Support**: Includes multilingual code and documentation **Task Categories**: Text generation, code modeling, language understanding **Tags**: Code, Chinese language, multilingual ---

annotations_creators: 机器生成 language_creators: 采集获取 language: 代码、中文、英语 license: 其他 multilinguality: 多语言 pretty_name: Gitee代码数据集 size_categories: 1亿<样本数<10亿 source_datasets: 原始数据集 task_categories: 文本生成 tags: 代码、中文 configs: - config_name: default data_files: - split: train path: "data/*.parquet" default: true dataset_info: features: - name: code dtype: string - name: repo_name dtype: string - name: path dtype: string - name: language dtype: string - name: license dtype: string - name: size dtype: int64 # Gitee代码数据集本数据集源自中国领先的开源代码托管平台Gitee（https://gitee.com），是一套大规模综合性代码数据集，专为训练具备优秀中文语言理解能力、贴合中文编码实践的代码模型而设计。 --- ## 概述 Gitee代码数据集是源自本土中文平台的最大规模代码语料库之一，收录了554种编程语言下的开源与企业级项目代码，是面向中文开发者与机构、定制化开发多语言代码理解模型的关键资源。 ### 关键统计指标本次数据集的核心统计指标如下： - 总文件数：819,472,785 - 总仓库数：3,105,923 - 压缩后大小：536 GB（采用Zstd压缩的Parquet格式） - 覆盖编程语言种类：554种 - Parquet分片文件总数：468个 --- ## 数据集特征 ### 覆盖范围与范畴本数据集收录了Gitee平台上超300万个仓库的代码，涵盖以下内容： - **中文聚焦内容**：全面覆盖中文开发者编写的代码，包含中文注释、文档与变量命名规范 - **多元语言生态**：支持554种不同的编程语言 - **企业与开源项目兼具**：涵盖从个人开发者到国内头部企业的各类项目，比例均衡 - **质量可控**：经过严格过滤，剔除了第三方vendor代码、构建产物、自动生成文件与低质量内容 ### 编程语言分布本数据集包含多类别下的554种编程语言，按文件数量排序的前30种编程语言如下： 1. Java：293,439,777个文件 2. JavaScript：77,715,425个文件 3. C：62,836,721个文件 4. C++：49,134,251个文件 5. HTML：46,191,063个文件 6. Vue：40,468,646个文件 7. PHP：37,132,954个文件 8. C#：33,842,369个文件 9. Python：25,192,704个文件 10. CSS：20,802,464个文件 11. TypeScript：20,122,528个文件 12. Go：16,176,561个文件 13. Shell：8,371,429个文件 14. Makefile：6,341,964个文件 15. Java Server Pages：6,224,523个文件 16. TSX：5,768,542个文件 17. CMake：5,581,774个文件 18. SCSS：5,291,031个文件 19. Objective-C：4,922,736个文件 20. Less：4,669,672个文件 21. Ruby：3,027,385个文件 22. Kotlin：2,986,211个文件 23. Scala：2,869,640个文件 24. Rust：2,466,122个文件 25. Starlark：2,027,514个文件 26. Dart：2,010,079个文件 27. Unix Assembly：1,900,320个文件 28. Fluent：1,882,380个文件 29. HTML+Razor：1,863,914个文件 30. Swift：1,607,477个文件 ### 许可证分布数据文件分布于多种开源许可证之下，为保障数据集的广泛可用性，我们已剔除采用限制性条款（CC-BY-ND、Commons Clause、SSPL）的仓库。各许可证对应的文件数量如下： - Apache 2.0许可证：273,706,950个文件 - MIT许可证：201,880,040个文件 - 未知许可证：195,868,240个文件 - AGPL 3.0许可证：60,181,320个文件 - BSD许可证：30,013,190个文件 - GPL 2.0许可证：27,831,530个文件 - LGPL 3.0许可证：11,746,750个文件 - LGPL 2.1许可证：4,807,600个文件 - BSD-3-Clause许可证：4,442,480个文件 - CC0 1.0许可证：3,144,920个文件 - GPL 3.0许可证：1,631,590个文件 - Unlicense许可证：1,181,930个文件 - BSD-2-Clause许可证：1,154,300个文件 - EPL 1.0许可证：1,045,470个文件 - 其他许可证：约5,800,000个文件 --- ## 数据集结构 ### 数据字段每条数据记录包含六个字段，提供完整的元数据与内容信息： 1. **`code`（字符串类型）**：采用UTF-8编码的完整源代码内容 2. **`repo_name`（字符串类型）**：仓库标识符，格式为`username/repository_name` 3. **`path`（字符串类型）**：相对于仓库根目录的文件路径 4. **`language`（字符串类型）**：采用[go-enry](https://github.com/go-enry/go-enry)识别的编程语言 5. **`license`（字符串类型）**：仓库许可证（SPDX标识符或"unknown"） 6. **`size`（int64类型）**：文件大小，单位为字节 ### 示例记录以下是一条示例数据记录： json { "code": "package com.example.demo; import org.springframework.boot.SpringApplication; ...", "repo_name": "username/spring-demo", "path": "src/main/java/com/example/demo/Application.java", "language": "Java", "license": "apache-2.0", "size": 1234 } ### 文件格式 - **格式**：采用Zstd压缩的Apache Parquet格式 - **分片数**：共468个分片文件，命名格式为`gitee_0000.parquet`至`gitee_0467.parquet` - **数据划分**：所有样本均归入单个训练集，无验证集与测试集划分 --- ## 数据构建流程 ### 流水线阶段本数据集通过系统化的多阶段流水线构建而成： 1. **仓库发现**：识别Gitee平台上的相关仓库 2. **分支选择**：按优先级顺序提取主分支：`master` → `main` → `develop` → `dev` → 首个可用分支 3. **仓库克隆**：下载选中的仓库 4. **内容提取与过滤**：对源代码文件进行智能提取与质量过滤 5. **Parquet序列化**：将处理后的记录写入压缩的Parquet分片文件 ### 语言检测编程语言识别采用[go-enry](https://github.com/go-enry/go-enry)工具，该工具是GitHub Linguist分类系统的Go语言实现。仅保留被归类为**编程类**或**标记类**的文件，剔除数据类与文本类文件。 ### 许可证识别许可证识别遵循三步流程： 1. 扫描许可证文件：搜索`LICENSE`、`LICENSE.txt`、`LICENSE.md`、`COPYING`等类似命名的文件 2. 许可证文本匹配：将扫描到的文本与已知的许可证模式（MIT、Apache 2.0、GPL变体、BSD、知识共享协议等）进行匹配 3. 未匹配到许可证时，默认标记为"unknown" 为保障数据集的广泛可用性，以下限制性许可证对应的仓库已被过滤： - 知识共享禁止演绎协议：`cc-by-nd`、`cc-by-nd-2.0`、`cc-by-nd-3.0`、`cc-by-nd-4.0` - `commons-clause` - 服务器端公共许可证：`sspl`、`sspl-1.0` ### 质量过滤本数据集采用多维度过滤机制保障质量与可用性： #### 大小限制 - 单个仓库压缩后最大大小：48 MB - 单个文件最大大小：1 MB - 单行最大长度：1000个字符 #### 需排除的目录 1. **版本控制与IDE配置目录**：`.git/`、`.github/`、`.gitlab/`、`.vscode/`、`.idea/`、`.vs/`、`.settings/`、`.eclipse/`、`.project/`、`.metadata/` 2. **依赖与第三方vendor代码目录**：`node_modules/`、`bower_components/`、`jspm_packages/`、`vendor/`、`third_party/`、`3rdparty/`、`external/`、`packages/`、`deps/`、`lib/vendor/`、`target/dependency/`、`Pods/` 3. **构建产物与输出目录**：`build/`、`dist/`、`out/`、`bin/`、`target/`、`release/`、`debug/`、`.next/`、`.nuxt/`、`_site/`、`_build/`、`__pycache__/`、`.pytest_cache/`、`cmake-build-*`、`.gradle/`、`.maven/` #### 需排除的文件 1. **依赖锁定文件**：`package-lock.json`、`yarn.lock`、`pnpm-lock.yaml`、`Gemfile.lock`、`Cargo.lock`、`poetry.lock`、`Pipfile.lock`、`composer.lock`、`go.sum`、`mix.lock` 2. **压缩后的代码文件**：文件名中包含`.min.`的文件 3. **二进制与非代码文件**： - 可执行文件：`.exe`、`.dll`、`.so`、`.dylib`、`.a`、`.lib`、`.o`、`.obj` - Java归档文件：`.jar`、`.war`、`.ear`、`.class`、`.pyc`、`.pyo`、`.wasm` - 文档文件：`.pdf`、`.doc`、`.docx`、`.xls`、`.xlsx`、`.ppt`、`.pptx` - 压缩归档文件：`.zip`、`.tar`、`.gz`、`.bz2`、`.7z`、`.rar` - 媒体文件：`.jpg`、`.jpeg`、`.png`、`.gif`、`.bmp`、`.ico`、`.svg`、`.mp3`、`.mp4`、`.avi`、`.mov`、`.wav`、`.flac` - 字体文件：`.ttf`、`.otf`、`.woff`、`.woff2`、`.eot` 4. **系统文件**：`.DS_Store`、`thumbs.db` #### 内容校验仅满足以下条件的文件方可被纳入数据集： - **文本编码**：需为合法的UTF-8编码 - **二进制文件检测**：被go-enry识别为二进制的文件将被剔除 - **自动生成标记检测**：若文件前500字节中包含以下自动生成标记，则将其过滤：`generated by`、`do not edit`、`auto-generated`、`autogenerated`、`automatically generated`、`code generator`、`generated code`、`this file is generated`、`@generated`、`<auto-generated` - **内容质量**：空文件或仅包含空白字符的文件将被剔除 - **单行长度**：存在单行长度超过1000字符的文件将被剔除 - **高级过滤**：通过go-enry的额外检测，剔除第三方vendor代码、图片文件、隐藏文件、测试文件与已识别的自动生成代码 - **仓库类型**：仅包含文档的仓库将被跳过 --- ## 使用注意事项 ### 数据隐私与安全本数据集可能包含敏感信息，需谨慎处理： - **电子邮箱地址**：可能出现在代码注释、文档或配置文件中 - **凭据信息**：可能包含意外提交的API密钥或身份验证令牌 - **个人信息**：注释或文档中可能包含姓名、电话号码等可识别的个人数据用户在准备用于模型训练的数据时，应实施适当的过滤与匿名化处理。 ### 许可证与归因本数据集聚合了来自不同许可证仓库的源代码，任何基于本数据集的代码或数据使用行为，必须遵守原始仓库的许可证条款，包括适用情况下的归因要求。每条数据记录中的`license`字段标明了源仓库的许可证类型，用户需承担以下责任： 1. 审查适用的许可证条款 2. 在有要求时提供正确的归因 3. 确保遵守许可证的限制条款 --- ## 技术细节 - **来源**：托管于[Gitee](https://gitee.com)的公开仓库 - **注释生成方式**：机器生成（语言检测、许可证识别） - **多语言支持**：包含多语言代码与文档 - **任务类别**：文本生成、代码建模、语言理解 - **标签**：代码、中文、多语言

提供机构：

Ujjwal-Tyagi

5,000+

优质数据集

54 个

任务类型

进入经典数据集