five

ronantakizawa/github-top-code

收藏
Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ronantakizawa/github-top-code
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - code tags: - code - github - source-code - trending-developers - software-engineering size_categories: - 1M<n<10M --- # GitHub Top Developer Source Code A curated dataset of 1.3M+ source code files from **GitHub's top ranked developers (2015-2025)**. This dataset is based on the top ranked developers from this dataset: https://huggingface.co/datasets/ronantakizawa/github-top-developers ## Dataset Summary - **1.3M+ source code files** from repositories across ~4,700 unique developers - **80+ programming languages** included (Python, JavaScript, TypeScript, Rust, Go, C/C++, Java, and more) - **Source code only** — config files (JSON, YAML, TOML, etc.) and documentation (Markdown, TXT) are excluded - **Permissive licenses only** (MIT, Apache-2.0, BSD, ISC, etc.) - **Rich metadata** per file: repo stars, description, primary language, developer company affiliation ![Screenshot 2026-02-23 at 10.41.38 AM](https://cdn-uploads.huggingface.co/production/uploads/65a752167bfcb01564e6276c/WJVtEjZijsz8zW3KT0TU4.png) ## Schema Each row represents a single source file: | Column | Type | Description | |--------|------|-------------| | `file_path` | string | Path within the repo (e.g. `src/main.py`) | | `file_language` | string | Language detected from file extension (e.g. `Python`, `JavaScript`) | | `content` | string | Raw source code (UTF-8) | | `repo_name` | string | Full repository name (`owner/repo`) | | `repo_stars` | int64 | GitHub star count at time of collection | | `repo_description` | string | Repository description | | `repo_primary_language` | string | GitHub-detected primary language of the repository | | `developer_username` | string | GitHub username | | `developer_name` | string | Developer display name | | `developer_company` | string | Company affiliation | **Note on language columns:** `file_language` is determined per-file from the file extension (e.g. a `.py` file is always `Python`). `repo_primary_language` is GitHub's auto-detected primary language for the entire repository. These may differ — for example, a C header file (`.h` → `C/C++ Header`) in a repo that GitHub classifies as `Python`. ## Splits | Split | Description | |-------|-------------| | `train` | ~90% of repos — for training | | `test` | ~5% of repos — for evaluation | | `validation` | ~5% of repos — for hyperparameter tuning | Splits are assigned **by repository** (deterministic hash), so no repo appears in multiple splits. This prevents data leakage from files in the same project. ## Usage ```python from datasets import load_dataset # Load a specific split train = load_dataset("ronantakizawa/github-top-code", split="train") test = load_dataset("ronantakizawa/github-top-code", split="test") # Filter by language python_files = train.filter(lambda x: x["file_language"] == "Python") # Filter by stars popular = train.filter(lambda x: x["repo_stars"] > 1000) # Get files from a specific developer dev_files = train.filter(lambda x: x["developer_username"] == "torvalds") ```
提供机构:
ronantakizawa
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作