Ujjwal-Tyagi/notabug
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Ujjwal-Tyagi/notabug
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language_creators:
- found
language:
- code
- en
license: other
multilinguality:
- multilingual
pretty_name: NotaBug Code Dataset
size_categories:
- 10M<n<100M
source_datasets:
- original
task_categories:
- text-generation
tags:
- code
configs:
- config_name: default
data_files:
- split: train
path: "data/*.parquet"
default: true
dataset_info:
features:
- name: code
dtype: string
- name: repo_name
dtype: string
- name: path
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: size
dtype: int64
---
# NotaBug Code Dataset
A comprehensive code dataset compiled from [NotaBug.org](https://notabug.org), a free code hosting platform that emphasizes software freedom, privacy, and fully open-source infrastructure. This dataset is specifically designed to support training code models using code from freedom-conscious developers and the free software community.
---
## Overview
The NotaBug Code Dataset represents a significant code corpus from a platform dedicated to software freedom and developer privacy. Built on a fully free software stack, it captures diverse open-source projects across thousands of programming languages and file types. This dataset serves as a valuable resource for developing code models that reflect free software practices and principles.
### Key Statistics
| Metric | Value |
|--------|-------|
| Total Files | 12,622,961 |
| Total Repositories | 11,660 |
| Compressed Size | 12 GB (Parquet with Zstd) |
| File Types/Languages | 6,306 (by file extension) |
| File Format | 12 Parquet files |
---
## Dataset Characteristics
### Scope and Coverage
This dataset captures code from over 11,600 repositories hosted on NotaBug.org, including:
- **Free software focused community**: Extensive coverage of code from developers committed to software freedom and open-source principles
- **Diverse language ecosystem**: Support for 6,306 distinct file types identified by file extension
- **Rich metadata**: Repository names, file paths, license information, and file sizes
- **Free software stack**: Code hosted on a platform built entirely on free and open-source software
- **Quality-assured**: Filtered to exclude files with excessively long lines
### File Types and Languages
The dataset encompasses files across 6,306 different types. The 30 most represented file types by file count are:
| Rank | File Type | File Count |
|------|-----------|------------|
| 1 | C++ | 2,219,208 |
| 2 | po | 2,022,441 |
| 3 | none | 1,572,451 |
| 4 | PHP | 951,354 |
| 5 | patch | 637,317 |
| 6 | svg | 547,170 |
| 7 | XML | 502,139 |
| 8 | Python | 392,476 |
| 9 | Text | 296,953 |
| 10 | JavaScript | 233,368 |
| 11 | JSON | 198,981 |
| 12 | Scheme | 192,409 |
| 13 | Markdown | 182,342 |
| 14 | info | 155,078 |
| 15 | slackbuild | 154,859 |
| 16 | HTML | 149,824 |
| 17 | Shell | 133,325 |
| 18 | log | 127,393 |
| 19 | Makefile | 112,989 |
| 20 | INI | 110,537 |
| 21 | Lua | 84,303 |
| 22 | in | 75,138 |
| 23 | Assembly | 74,519 |
| 24 | list | 58,346 |
| 25 | Java | 48,781 |
| 26 | CSS | 48,112 |
| 27 | mk | 47,373 |
| 28 | dtsi | 43,825 |
| 29 | diff | 42,125 |
| 30 | el | 41,017 |
### License Distribution
Files are distributed across various open-source licenses, with MIT being predominant:
| License | File Count |
|---------|------------|
| MIT | 10,029,349 |
| MPL 2.0 | 1,178,420 |
| Unknown | 888,840 |
| GPL 2.0 | 333,538 |
| GPL 3.0 | 158,975 |
| Unlicense | 11,805 |
| CC-BY 4.0 | 8,367 |
| BSD-2-Clause | 4,718 |
| AGPL 3.0 | 3,055 |
| CC-BY-SA 4.0 | 2,309 |
| WTFPL | 1,314 |
| CC0 1.0 | 1,188 |
| BSD-3-Clause | 601 |
| CC-BY-NC 4.0 | 269 |
| LGPL 3.0 | 137 |
| LGPL 2.1 | 76 |
---
## Dataset Structure
### Data Fields
Each record contains six fields providing comprehensive metadata and content information:
| Field | Type | Description |
|-------|------|-------------|
| `code` | string | The complete source file content in UTF-8 encoding |
| `repo_name` | string | Repository identifier in the format `username/repository_name` |
| `path` | string | File path relative to the repository root |
| `language` | string | File type/language inferred from file extension |
| `license` | string | Repository license (SPDX identifier or "unknown") |
| `size` | int64 | File size in bytes |
### Sample Record
```json
{
"code": "#!/usr/bin/env python2\n# -*- coding: utf-8 -*-\n# Copyright (C) 2014...",
"repo_name": "intermsofthewhole/libreboot",
"path": "resources/utilities/i945gpu/intel-regs.py",
"language": "Python",
"license": "mit",
"size": 3733
}
```
### File Format
- **Format**: Apache Parquet with Zstd compression
- **Structure**: 12 consolidated files (`notabug_0000.parquet` to `notabug_0011.parquet`)
- **Encoding**: UTF-8
- **Split**: All examples are included in a single training split (no validation or test splits)
---
## Data Creation Process
### Language Detection Methodology
Programming languages and file types are identified by file extension inference. This approach captures the intended purpose of each file while maintaining broad language coverage across the platform.
### Source Data
All data originates from public repositories hosted on [NotaBug.org](https://notabug.org), a platform built on fully free and open-source software infrastructure.
### License Detection
License identification follows a systematic approach:
1. Scan repositories for license files and declarations
2. Match license text against known license patterns
3. Default to "unknown" if no license can be detected
### Quality Filtering
The dataset has undergone systematic filtering to ensure quality:
#### Line Length Constraints
- Files with any line exceeding 1,000 characters are excluded
- This ensures compatibility with text processing pipelines
#### Deduplication Policy
- No deduplication was performed on the dataset
- All files from public repositories are included as-is
#### Content Inclusion
- UTF-8 encoded files are retained
- All public repository content is preserved
---
## Usage Considerations
### Data Privacy and Security
The dataset may contain sensitive information that requires careful handling:
- **Email Addresses**: Present in code comments, documentation, or configuration files
- **Credentials**: Accidentally committed API keys or authentication tokens
- **Personal Information**: Names, phone numbers, and other identifiable data in comments or documentation
Users should implement appropriate filtering and anonymization when preparing data for model training.
### Licensing and Attribution
This dataset aggregates source code from repositories with diverse licenses. Any use of code or data derived from this dataset must comply with the original repository licenses, including attribution requirements where applicable.
The `license` field in each record indicates the license of the source repository. Users are responsible for:
- Reviewing applicable license terms
- Providing proper attribution when required
- Ensuring compliance with license restrictions
- Respecting the software freedom principles of the original projects
### Free Software Principles
Users of this dataset should be mindful of the free software values of the NotaBug.org community:
- Respect developer privacy and freedom
- Support open-source software development
- Consider contributing improvements back to the community
- Use the dataset ethically and transparently
---
## Technical Details
**Source**: Public repositories hosted on [NotaBug.org](https://notabug.org)
**Annotations**: Machine-generated (file type detection, license identification)
**Language Detection**: File extension-based inference
**Task Categories**: Text generation, code modeling, language understanding
**Tags**: Code, free software, open-source, software freedom
**Platform Stack**: Built on fully free and open-source software (GNU/Linux, GitLab Community Edition, etc.)
---
提供机构:
Ujjwal-Tyagi



