five

FurkanNar/PyHub

收藏
Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/FurkanNar/PyHub
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - text-classification - question-answering language: - en tags: - code size_categories: - 10K<n<100K --- --- license: mit task_categories: - text-generation - text-classification - question-answering language: - en tags: - code size_categories: - 10K<n<100K --- # PyHub: Vetted Python Code from Popular GitHub Repositories [![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/) [![License](https://img.shields.io/badge/License-MIT-green)](LICENSE) A large-scale dataset of Python source code, test files, and documentation scraped from high-quality GitHub repositories. Designed for training code understanding and generation models, particularly for software engineering benchmarks like SWE-bench. ## Dataset Statistics - **Total files**: 271,995 - **Repositories**: 50+ (minimum 50 stars) - **File types**: Python source, test files, READMEs - **Time period**: Repositories created before January 1, 2020 - **Size limit**: Maximum 100 MB per repository - **License**: MIT ## Dataset Structure Each row in `dataset.csv` represents a single file with the following columns: | Column | Description | |--------|-------------| | `repo_name` | Repository name (e.g., "requests") | | `repo_full_name` | Full repository name (e.g., "psf/requests") | | `owner` | Repository owner (e.g., "psf") | | `stars` | Star count | | `license` | SPDX license identifier | | `repo_description` | Repository description | | `filepath` | Relative path within repository | | `file_type` | "python", "test", or "readme" | | `language` | "Python", "Markdown", "reStructuredText", or "" | | `content` | Full file text | | `size_bytes` | File size in bytes | | `num_lines` | Number of lines | ## File Types - **Python source** (`.py`): Production code files - **Test files** (`*test*.py`): Unit tests and test suites - **README files** (`README.*`): Documentation in Markdown, reStructuredText, or plain text ## Collection Methodology The dataset was collected using a custom GitHub scraper with the following process: 1. **Repository selection**: GitHub API search for repositories with x ≥ 50 stars, created before 2020-01-01, non-fork 2. **Cloning**: Shallow git clone (`--depth 1`) with 100 MB size filter to exclude large monorepos 3. **File collection**: Recursive walk through cloned repositories, excluding hidden directories (files starting with `.`) 4. **File type filtering**: Only Python source files (`.py`), test files (`*test*.py`), and README files (`README.*`) were collected 5. **Content extraction**: UTF-8 encoding with error handling for robust text extraction 6. **Parallel processing**: 3 concurrent workers for efficient processing 7. **CSV generation**: All file data consolidated into a single CSV with repository metadata embedded in each row ## Quality Filters - **Star threshold**: Minimum 50 stars (indicates community vetting) - **Size limit**: 100 MB to exclude monorepos and binary-heavy projects - **File type filtering**: Only Python, test, and documentation files - **Hidden files excluded**: Files/directories starting with `.` ignored - **Encoding handling**: UTF-8 with error fallback ## Intended Use Cases - **Code completion**: Training autocompletion models on real-world Python patterns - **Bug detection**: Learning from production codebases with established testing practices - **Test generation**: Understanding test-code relationships from included test files - **Documentation generation**: Learning code-documentation correlations from READMEs - **SWE-bench training**: Base dataset for software engineering benchmark preparation - **Code understanding**: Repository structure and dependency learning ## Limitations - **Temporal bias**: Pre-2020 code, missing modern Python features (type hints, match statements, structural pattern matching) - **Popularity bias**: High-star repos only, may not represent niche or edge-case patterns - **Size limitation**: 100 MB cap excludes large enterprise monorepos - **Language bias**: Primarily English documentation and comments - **Static only**: No execution data, test results, or runtime behavior ## Recommended Supplements For comprehensive model training, consider supplementing with: - Post-2020 repositories for modern Python patterns - Smaller repositories for edge-case and niche patterns - Synthetic examples for specific bug types - Negative examples (buggy code) for robustness ## License This dataset is licensed under the MIT License. See the LICENSE file for details. ## Contact furkannar168@hotmail.com or you can simply open up an issue for issues or questions that you'd like to adress or ask. --- **Note**: This dataset was created using a custom GitHub scraper tool.
提供机构:
FurkanNar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作