five

helloadhavan/github_issues

收藏
Hugging Face2026-04-11 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/helloadhavan/github_issues
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit configs: - config_name: default data_files: - split: train path: data/train-* - split: eval path: data/eval-* - split: test path: data/test-* dataset_info: features: - name: repo dtype: string - name: fix_commit dtype: string - name: buggy_commit dtype: string - name: message dtype: string - name: files list: - name: path dtype: string - name: patch dtype: string - name: additions dtype: int64 - name: deletions dtype: int64 - name: language dtype: string - name: timestamp dtype: timestamp[s] splits: - name: train num_bytes: 1561640639 num_examples: 115096 - name: eval num_bytes: 29054081 num_examples: 3000 - name: test num_bytes: 29054081 num_examples: 3000 download_size: 549629363 dataset_size: 1619748801 task_categories: - text-generation - summarization language: - en tags: - code pretty_name: Github issues dataset size_categories: - 100K<n<1M --- # GitHub Pull Request Bug–Fix Dataset [Kaggle url](https://www.kaggle.com/datasets/adhavanyuvaraj/github-issues) A **curated, high-signal dataset of real-world software bugs and fixes** collected from **25 popular open-source GitHub repositories**. Each entry corresponds to a **single pull request (PR)** and pairs contextual metadata with the **exact code changes (unified diffs)** that fixed the bug. This dataset is designed for: - **Automated program repair** - **Bug-fix patch generation** - **LLM-based code and debugging agents** - **Empirical software engineering research** --- ## How to use install datasets python library: ```bash pip install datasets ``` here is a copy paste example ```python from datasets import load_dataset # Load all splits dataset = load_dataset("helloadhavan/github_issues") print(dataset) # pick the train split example = dataset["train"][0] # Inspect a single example print("Repository:", example["repo"]) print("Buggy commit:", example["buggy_commit"]) print("Fix commit:", example["fix_commit"]) print("Message:", example["message"]) print("Timestamp:", example["timestamp"]) print("\nModified files:") for f in example["files"]: print("-", f["path"], f["language"]) # Filter examples by programming language def contains_assembly_file(example): return any(f["language"] == "Assembly" for f in example["files"]) python_fixes = dataset["train"].filter(contains_assembly_file) print("Assembly-related fixes:", len(python_fixes)) ``` ## Data collection methodology Data was collected from **GitHub repositories** by identifying commit pairs that represent a **bug-introducing version** and its corresponding **fix commit**. The dataset was constructed and post-processed to ensure high signal and usability: - Only commits representing **bug fixes or correctness changes** were included - Each example explicitly links a **buggy commit** to the corresponding **fix commit** - Repository metadata is preserved for traceability - Code changes are stored as **unified diffs at the file level** - Commits that only perform refactoring, formatting, or non-functional changes were excluded - Entries without meaningful code changes were filtered out Each dataset row represents **one bug–fix commit pair**, rather than a pull request. --- ## Dataset schema Each entry in the dataset follows the schema below: ```json { "repo": "owner/repository", "buggy_commit": "abcdef123456...", "fix_commit": "fedcba654321...", "message": "Commit message describing the fix", "timestamp": "YYYY-MM-DDTHH:MM:SSZ", "files": [ { "path": "path/to/file.ext", "patch": "unified diff representing the fix", "additions": 10, "deletions": 2, "language": "Programming language inferred from file extension" } ] } ``` | Field | Description | | ------------------- | ----------------------------------------------------- | | `repo` | GitHub repository containing the fix | | `buggy_commit` | Commit introducing or containing the bug | | `fix_commit` | Commit that fixes the bug | | `message` | Commit message associated with the fix | | `timestamp` | Timestamp of the fix commit (ISO 8601 format) | | `files` | List of files modified by the fix | | `files[].path` | Path to the modified file | | `files[].patch` | Unified diff containing the code changes | | `files[].additions` | Number of lines added | | `files[].deletions` | Number of lines removed | | `files[].language` | Programming language inferred from the file extension | ## Supported languages The dataset contains fixes across multiple programming languages, including (but not limited to): * JavaScript / TypeScript * C / C++ * Python * Rust * Go * Java * Objective-C / Objective-C++ (rare) * Assembly (very rare. only 638 samples) Language distribution varies by repository. ## Intended use cases This dataset is well-suited for: * Training models to generate patches from real pull request context * Studying bug-fix patterns across large codebases * Building autonomous debugging or repair agents * Research in program repair, code synthesis, and software maintenance It is not intended for: * Pull request classification or triage * Sentiment analysis ## Limitations The dataset reflects real-world noise from GitHub pull requests Buggy commit identification is heuristic and may be imperfect Some fixes involve refactoring or design changes rather than minimal patches No guarantee that fixes represent optimal or best-practice solutions <blockquote style=" background: #fff7cc; border-left: 5px solid #ffad00; padding: 12px 16px; color: #5c4b00; font-style: italic; border-radius: 4px; " > <strong style="color:rgba(57, 0, 0, 1)">Note:</strong> Due to a bug in the scraper code, 121k samples were collected instead of the planned 50k. </blockquote>
提供机构:
helloadhavan
搜集汇总
数据集介绍
main_image_url
构建方式
在软件工程领域,高质量的缺陷修复数据对于自动化程序修复研究至关重要。该数据集通过系统化方法从25个知名开源GitHub仓库中采集真实世界的软件缺陷与修复记录,其构建过程聚焦于识别代表缺陷引入版本的提交及其对应的修复提交。为确保数据的高信噪比与实用性,构建过程仅纳入涉及缺陷修复或正确性变更的提交,并明确关联缺陷提交与修复提交。数据经过后处理过滤了仅包含重构、格式化或非功能性变更的条目,同时剔除了无实质性代码修改的记录,最终每个数据行对应一个缺陷修复提交对,而非完整的拉取请求。
特点
该数据集的核心特征在于其精心策划的真实世界缺陷修复对集合,每个条目均包含仓库元数据、缺陷提交与修复提交的哈希值、提交消息、时间戳以及文件级别的统一差异补丁。数据结构清晰,详细记录了每个修改文件的路径、补丁内容、增删行数以及基于文件扩展名推断的编程语言。数据集覆盖多种编程语言,包括JavaScript、Python、C++等,且语言分布随仓库而异,为跨语言软件维护研究提供了丰富素材。其规模适中,包含超过12万条训练样本及数千条评估与测试样本,适用于模型训练与实证分析。
使用方法
为便于学术与工程应用,该数据集可通过Hugging Face的datasets库直接加载。用户安装相应库后,可轻松访问训练、评估与测试分割。典型使用流程包括加载数据集、检查样本结构以及基于特定条件(如编程语言)进行过滤。例如,可通过检查文件语言字段筛选出涉及特定编程语言的修复记录,从而支持针对性的模型训练或模式分析。该数据集主要服务于自动化程序修复、缺陷修复补丁生成、基于大语言模型的代码与调试代理以及实证软件工程研究,但不适用于拉取请求分类或情感分析等任务。
背景与挑战
背景概述
在软件工程领域,自动化程序修复与代码缺陷分析一直是提升软件质量与开发效率的核心研究方向。GitHub Issues数据集由研究团队于近年构建,旨在从25个流行的开源GitHub仓库中收集真实世界的软件缺陷与修复案例。该数据集通过精心筛选,将每个条目对应为一个拉取请求,并关联了缺陷引入版本与修复提交的代码变更差异。其核心研究问题聚焦于如何利用大规模历史缺陷数据,驱动自动化程序修复、补丁生成以及基于大语言模型的代码调试代理等任务,为实证软件工程研究提供了高质量、高信号的数据基础,显著推动了智能软件维护工具的发展。
当前挑战
该数据集致力于解决自动化程序修复领域的核心挑战,即如何从复杂的代码变更中准确识别缺陷模式并生成有效的修复补丁。构建过程中面临多重挑战:首先,从海量GitHub提交中精确区分缺陷修复提交与非功能性变更需要依赖启发式方法,可能导致噪声引入;其次,数据收集需处理多编程语言的代码差异与仓库元数据的异构性,确保数据的一致性与可追溯性;此外,数据集的规模意外超出原计划,虽增加了样本量,但也可能影响数据的均衡性与代表性。这些挑战共同凸显了在真实软件工程环境中构建高质量缺陷数据集的复杂性。
常用场景
经典使用场景
在软件工程与人工智能交叉领域,GitHub Issues数据集为自动化程序修复研究提供了关键资源。该数据集通过精心收集的真实世界软件缺陷与修复代码对,使研究人员能够训练模型学习从错误代码到正确补丁的映射关系。经典使用场景包括基于深度学习的补丁生成系统,这些系统利用数据集中的统一差异代码变更,模拟开发者修复漏洞的决策过程,推动智能调试工具的发展。
解决学术问题
该数据集有效解决了程序自动修复领域长期存在的训练数据稀缺问题。通过提供大规模、高质量的缺陷-修复配对样本,研究者能够深入分析软件漏洞的演化模式与修复规律。其意义在于为实证软件工程研究建立了可重复的实验基准,显著提升了代码生成模型在真实场景下的泛化能力,对软件维护自动化研究产生了深远影响。
衍生相关工作
基于该数据集衍生的经典工作包括神经机器翻译架构在代码补丁生成领域的适应性研究,以及基于Transformer的缺陷定位模型开发。多项研究利用其构建了基准测试套件,如APR(Automated Program Repair)评估框架,推动了《IEEE Transactions on Software Engineering》等顶级期刊中多篇标志性论文的发表,形成了软件工程智能化研究的重要分支。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作