github-code-2025
收藏魔搭社区2026-01-08 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/nick007x/github-code-2025
下载链接
链接失效反馈官方服务:
资源简介:
# 🚀 GitHub Code 2025: The Clean Code Manifesto
> **A meticulously curated dataset of 1.5M+ repositories representing both quality and innovation in 2025's code ecosystem**
## 🌟 The Philosophy
**Quality Over Quantity, Purpose Over Volume**
In an era of data abundance, we present a dataset built on radical curation. Every file, every repository, every byte has been carefully selected to represent the **signal** in the noise of open-source development.
## 🎯 What This Dataset Is
### 📊 Dual-Perspective Design
| Subset | 🎖️ Above 2 Stars | 🌱 Below 2 Stars (2025) |
|--------|------------------|------------------------|
| **Scope** | 1M top repositories | 1M random 2025 repos |
| **Purpose** | Proven quality & patterns | Emerging trends & innovation |
| **Value** | What works | What's next |
### 🧹 The Clean Code Promise
```python
# What you WON'T find here:
🚫 Binary files # No images, executables, models
🚫 Build artifacts # No node_modules, __pycache__
🚫 Configuration noise # No .git, IDE files, lock files
🚫 License duplication # No repetitive legal text
🚫 Minified code # No compressed/obfuscated content
🚫 Empty files # No whitespace-only content
```
## 📁 Dataset Structure
```
github-code-2025/
├── 📈 above-2-stars/
│ ├── train_000.parquet
│ ├── train_001.parquet
│ └── ...
└── 🌱 below-2-star/
├── train_000.parquet
├── train_001.parquet
└── ...
```
### 📊 Schema
```python
{
"repo_id": "owner/repo_name", # 📍 Repository identifier
"file_path": "src/main.py", # 🗂️ Relative file path
"content": "def clean_code():", # 💎 Actual source code
"size": 1024 # 📏 File size in bytes
}
```
## 🛠️ How to Use
### 🔥 Quick Start
```python
from datasets import load_dataset
# Load the quality benchmark
quality_ds = load_dataset("nick007x/github-code-2025", "above-2-stars")
# Load emerging trends
emerging_ds = load_dataset("nick007x/github-code-2025", "below-2-star")
# Mix for balanced training
balanced_ds = interleave_datasets([quality_ds, emerging_ds])
```
### 🎯 Ideal Use Cases
- **🧠 AI Training**: Clean, diverse code for language models
- **📊 Code Analysis**: Compare popular vs emerging patterns
- **🔍 Trend Research**: 2025 development practices
- **🎓 Education**: High-quality examples for learning
- **🛠️ Tool Development**: Benchmarking code quality tools
## 🏗️ Creation Methodology
### 🎨 Selection Strategy
| Phase | Action | Purpose |
|-------|--------|---------|
| **1** | 🎯 Dual population sampling | Balance quality & innovation |
| **2** | 🧹 Multi-layer filtering | Remove noise & binaries |
| **3** | 📏 Size normalization | Focus on meaningful content |
| **4** | 🔍 Content validation | Ensure text quality |
| **5** | 🏷️ Metadata preservation | Maintain context |
### 🚫 What We Filtered Out
**File Types Removed:**
- 50+ binary extensions (images, models, executables)
- 30+ build/system directories
- 15+ configuration file types
- All files outside 1KB-5MB range
**Quality Checks:**
- ✅ UTF-8 text validation
- ✅ Non-empty content check
- ✅ Binary detection
- ✅ Repository structure preservation
## 🎪 Why This Dataset Matters
### 💫 The Quality Revolution
We reject the "more data is better" dogma. Instead, we offer:
- **🎯 Intentional Curation**: Every file serves a purpose
- **⚖️ Balanced Perspective**: Popular + Emerging = Complete picture
- **🧹 Unprecedented Cleanliness**: The cleanest code dataset available
- **📅 Temporal Intelligence**: 2025-focused for relevance
## 🤝 Contributing & Feedback
This dataset is a living project. We welcome:
- 🐛 Bug reports and issues
- 💡 Feature requests for future versions
- 📊 Validation of data quality
- 🎯 Suggestions for improvement
## 📜 License
This dataset aggregates Github repos. Each individual repo maintains its original copyright and license terms (typically various Creative Commons licenses like CC BY, CC BY-NC, etc.).
Users must verify and comply with the specific license of any repo they extract and use from this collection.
The MIT license in this repository applies only to the dataset compilation and packaging code.
**Important**: Repository contents maintain their original licenses. Please respect individual project licenses when using this data.
## 🙏 Acknowledgments
Built with gratitude for the entire open-source community. Every file in this dataset represents hours of dedication from developers worldwide.
---
**⭐ If this dataset helps your research or project, please consider starring the repository!**
> **"In the pursuit of AI that understands code, we must first understand what code is worth learning."**
# 🚀 GitHub Code 2025:纯净代码宣言(Clean Code Manifesto)
> **本数据集经过精心筛选,包含150万余个代码仓库,展现了2025年代码生态中的优质实践与创新活力**
## 🌟 核心理念
**以质胜量,以意胜繁**
在数据过载的时代,本数据集以极致筛选为核心构建而成。每一份文件、每一个仓库、每一字节数据均经过严格遴选,旨在从开源开发的海量杂音中提炼出真正有价值的**信号(signal)**。
## 🎯 数据集定位
### 📊 双视角设计
| 子集分类 | 🎖️ 2星以上仓库 | 🌱 2星以下仓库(2025年) |
|--------|------------------|------------------------|
| **覆盖范围** | 100万个顶级优质仓库 | 100万个2025年随机抽取的仓库 |
| **核心用途** | 验证过的优质实践与代码范式 | 新兴趋势与创新方向 |
| **核心价值** | 当下可行的解决方案 | 未来的发展趋势 |
### 🧹 纯净代码承诺
python
# 此处绝对不会出现以下内容:
🚫 二进制文件 # 不包含图片、可执行文件、模型文件
🚫 构建产物 # 不包含node_modules、__pycache__等目录
🚫 配置噪音 # 不包含.git、IDE配置文件、锁文件
🚫 重复许可证文本 # 不包含冗余的法律条文
🚫 压缩混淆代码 # 不包含经过压缩或混淆的内容
🚫 空文件 # 不包含仅含空白字符的内容
## 📁 数据集结构
github-code-2025/
├── 📈 above-2-stars/
│ ├── train_000.parquet
│ ├── train_001.parquet
│ └── ...
└── 🌱 below-2-star/
├── train_000.parquet
├── train_001.parquet
└── ...
### 📊 数据模式
python
{
"repo_id": "owner/repo_name", # 📍 仓库唯一标识符
"file_path": "src/main.py", # 🗂️ 文件相对路径
"content": "def clean_code()", # 💎 原始源代码内容
"size": 1024 # 📏 文件大小,单位:字节
}
## 🛠️ 使用方法
### 🔥 快速上手
python
from datasets import load_dataset
# 加载优质基准数据集
quality_ds = load_dataset("nick007x/github-code-2025", "above-2-stars")
# 加载新兴趋势数据集
emerging_ds = load_dataset("nick007x/github-code-2025", "below-2-star")
# 混合数据集以实现训练均衡
balanced_ds = interleave_datasets([quality_ds, emerging_ds])
### 🎯 理想应用场景
- **🧠 AI训练**: 适用于大语言模型(Large Language Model)的纯净、多样化代码训练数据
- **📊 代码分析**: 对比主流与新兴代码范式
- **🔍 趋势研究**: 探索2025年的开发实践
- **🎓 教育教学**: 用于学习的高质量代码示例
- **🛠️ 工具开发**: 用于代码质量检测工具的基准测试
## 🏗️ 构建方法论
### 🎨 筛选策略
| 阶段 | 操作步骤 | 核心目的 |
|-------|--------|---------|
| **1** | 🎯 双群体抽样 | 平衡优质实践与创新方向 |
| **2** | 🧹 多层过滤 | 去除杂音与二进制文件 |
| **3** | 📏 尺寸归一化 | 聚焦有意义的内容 |
| **4** | 🔍 内容验证 | 确保文本质量 |
| **5** | 🏷️ 元数据保留 | 维护上下文信息 |
### 🚫 已过滤内容
**已移除的文件类型:**
- 50余种二进制文件扩展名(涵盖图片、模型、可执行文件等)
- 30余个构建/系统目录
- 15余种配置文件类型
- 所有大小不在1KB至5MB范围内的文件
**质量检查标准:**
- ✅ UTF-8文本格式验证
- ✅ 非空内容检查
- ✅ 二进制文件检测
- ✅ 保留仓库原始结构
## 🎪 数据集价值
### 💫 品质革新
我们摒弃“数据越多越好”的教条,转而提供:
- **🎯 精准筛选**: 每一份文件均具备明确用途
- **⚖️ 均衡视角**: 兼顾主流实践与新兴趋势,呈现完整开发图景
- **🧹 极致纯净**: 目前市面上最纯净的代码数据集
- **📅 时效性**: 聚焦2025年代码生态,贴合当下研发需求
## 🤝 贡献与反馈
本数据集为持续迭代的开源项目,我们欢迎:
- 🐛 问题报告与缺陷反馈
- 💡 未来版本的功能建议
- 📊 数据质量验证协助
- 🎯 各类改进建议
## 📜 授权声明
本数据集聚合自GitHub公开仓库,每个独立仓库仍保留其原始版权与授权条款(通常为各类知识共享协议,如CC BY、CC BY-NC等)。
用户在提取并使用本数据集内的内容时,必须验证并遵守对应仓库的专属授权协议。
本仓库内的MIT许可证仅适用于数据集的编译与打包代码。
**重要提示**:数据集内的仓库内容仍保留其原始授权。使用本数据时,请尊重各项目的专属授权条款。
## 🙏 致谢
本项目的完成离不开全体开源社区开发者的贡献。数据集内的每一份文件,都凝聚了全球开发者的心血与汗水。
---
**⭐ 如果本数据集对你的研究或项目有所帮助,请考虑为该仓库点亮Star!**
> **"在打造理解代码的AI之路上,我们首先要明确哪些代码值得学习。"**
提供机构:
maas
创建时间:
2025-10-17



