five

Ujjwal-Tyagi/notabug

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Ujjwal-Tyagi/notabug
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - found language: - code - en license: other multilinguality: - multilingual pretty_name: NotaBug Code Dataset size_categories: - 10M<n<100M source_datasets: - original task_categories: - text-generation tags: - code configs: - config_name: default data_files: - split: train path: "data/*.parquet" default: true dataset_info: features: - name: code dtype: string - name: repo_name dtype: string - name: path dtype: string - name: language dtype: string - name: license dtype: string - name: size dtype: int64 --- # NotaBug Code Dataset A comprehensive code dataset compiled from [NotaBug.org](https://notabug.org), a free code hosting platform that emphasizes software freedom, privacy, and fully open-source infrastructure. This dataset is specifically designed to support training code models using code from freedom-conscious developers and the free software community. --- ## Overview The NotaBug Code Dataset represents a significant code corpus from a platform dedicated to software freedom and developer privacy. Built on a fully free software stack, it captures diverse open-source projects across thousands of programming languages and file types. This dataset serves as a valuable resource for developing code models that reflect free software practices and principles. ### Key Statistics | Metric | Value | |--------|-------| | Total Files | 12,622,961 | | Total Repositories | 11,660 | | Compressed Size | 12 GB (Parquet with Zstd) | | File Types/Languages | 6,306 (by file extension) | | File Format | 12 Parquet files | --- ## Dataset Characteristics ### Scope and Coverage This dataset captures code from over 11,600 repositories hosted on NotaBug.org, including: - **Free software focused community**: Extensive coverage of code from developers committed to software freedom and open-source principles - **Diverse language ecosystem**: Support for 6,306 distinct file types identified by file extension - **Rich metadata**: Repository names, file paths, license information, and file sizes - **Free software stack**: Code hosted on a platform built entirely on free and open-source software - **Quality-assured**: Filtered to exclude files with excessively long lines ### File Types and Languages The dataset encompasses files across 6,306 different types. The 30 most represented file types by file count are: | Rank | File Type | File Count | |------|-----------|------------| | 1 | C++ | 2,219,208 | | 2 | po | 2,022,441 | | 3 | none | 1,572,451 | | 4 | PHP | 951,354 | | 5 | patch | 637,317 | | 6 | svg | 547,170 | | 7 | XML | 502,139 | | 8 | Python | 392,476 | | 9 | Text | 296,953 | | 10 | JavaScript | 233,368 | | 11 | JSON | 198,981 | | 12 | Scheme | 192,409 | | 13 | Markdown | 182,342 | | 14 | info | 155,078 | | 15 | slackbuild | 154,859 | | 16 | HTML | 149,824 | | 17 | Shell | 133,325 | | 18 | log | 127,393 | | 19 | Makefile | 112,989 | | 20 | INI | 110,537 | | 21 | Lua | 84,303 | | 22 | in | 75,138 | | 23 | Assembly | 74,519 | | 24 | list | 58,346 | | 25 | Java | 48,781 | | 26 | CSS | 48,112 | | 27 | mk | 47,373 | | 28 | dtsi | 43,825 | | 29 | diff | 42,125 | | 30 | el | 41,017 | ### License Distribution Files are distributed across various open-source licenses, with MIT being predominant: | License | File Count | |---------|------------| | MIT | 10,029,349 | | MPL 2.0 | 1,178,420 | | Unknown | 888,840 | | GPL 2.0 | 333,538 | | GPL 3.0 | 158,975 | | Unlicense | 11,805 | | CC-BY 4.0 | 8,367 | | BSD-2-Clause | 4,718 | | AGPL 3.0 | 3,055 | | CC-BY-SA 4.0 | 2,309 | | WTFPL | 1,314 | | CC0 1.0 | 1,188 | | BSD-3-Clause | 601 | | CC-BY-NC 4.0 | 269 | | LGPL 3.0 | 137 | | LGPL 2.1 | 76 | --- ## Dataset Structure ### Data Fields Each record contains six fields providing comprehensive metadata and content information: | Field | Type | Description | |-------|------|-------------| | `code` | string | The complete source file content in UTF-8 encoding | | `repo_name` | string | Repository identifier in the format `username/repository_name` | | `path` | string | File path relative to the repository root | | `language` | string | File type/language inferred from file extension | | `license` | string | Repository license (SPDX identifier or "unknown") | | `size` | int64 | File size in bytes | ### Sample Record ```json { "code": "#!/usr/bin/env python2\n# -*- coding: utf-8 -*-\n# Copyright (C) 2014...", "repo_name": "intermsofthewhole/libreboot", "path": "resources/utilities/i945gpu/intel-regs.py", "language": "Python", "license": "mit", "size": 3733 } ``` ### File Format - **Format**: Apache Parquet with Zstd compression - **Structure**: 12 consolidated files (`notabug_0000.parquet` to `notabug_0011.parquet`) - **Encoding**: UTF-8 - **Split**: All examples are included in a single training split (no validation or test splits) --- ## Data Creation Process ### Language Detection Methodology Programming languages and file types are identified by file extension inference. This approach captures the intended purpose of each file while maintaining broad language coverage across the platform. ### Source Data All data originates from public repositories hosted on [NotaBug.org](https://notabug.org), a platform built on fully free and open-source software infrastructure. ### License Detection License identification follows a systematic approach: 1. Scan repositories for license files and declarations 2. Match license text against known license patterns 3. Default to "unknown" if no license can be detected ### Quality Filtering The dataset has undergone systematic filtering to ensure quality: #### Line Length Constraints - Files with any line exceeding 1,000 characters are excluded - This ensures compatibility with text processing pipelines #### Deduplication Policy - No deduplication was performed on the dataset - All files from public repositories are included as-is #### Content Inclusion - UTF-8 encoded files are retained - All public repository content is preserved --- ## Usage Considerations ### Data Privacy and Security The dataset may contain sensitive information that requires careful handling: - **Email Addresses**: Present in code comments, documentation, or configuration files - **Credentials**: Accidentally committed API keys or authentication tokens - **Personal Information**: Names, phone numbers, and other identifiable data in comments or documentation Users should implement appropriate filtering and anonymization when preparing data for model training. ### Licensing and Attribution This dataset aggregates source code from repositories with diverse licenses. Any use of code or data derived from this dataset must comply with the original repository licenses, including attribution requirements where applicable. The `license` field in each record indicates the license of the source repository. Users are responsible for: - Reviewing applicable license terms - Providing proper attribution when required - Ensuring compliance with license restrictions - Respecting the software freedom principles of the original projects ### Free Software Principles Users of this dataset should be mindful of the free software values of the NotaBug.org community: - Respect developer privacy and freedom - Support open-source software development - Consider contributing improvements back to the community - Use the dataset ethically and transparently --- ## Technical Details **Source**: Public repositories hosted on [NotaBug.org](https://notabug.org) **Annotations**: Machine-generated (file type detection, license identification) **Language Detection**: File extension-based inference **Task Categories**: Text generation, code modeling, language understanding **Tags**: Code, free software, open-source, software freedom **Platform Stack**: Built on fully free and open-source software (GNU/Linux, GitLab Community Edition, etc.) ---
提供机构:
Ujjwal-Tyagi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作