GoBug Defect Dataset

Name: GoBug Defect Dataset
Creator: Emre Can Yılmaz
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/gobug-defect-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

To address the growing need for empirical resources in modern software ecosystems, we present GoBug, the first large-scale, multi-level dataset for software defect prediction (SDP) in the Go programming language (Go). The dataset was constructed from 16 prominent, high-impact open-source Go projects from GitHub, including Kubernetes, Terraform, and InfluxDB.GoBug provides labeled data at three distinct levels of granularity: commit, file, and method. A key feature of this dataset is its rich and comprehensive set of metrics, which combines traditional process metrics (e.g., code churn, author count) and static code metrics (e.g., cyclomatic complexity, LOC) with a novel suite of Go-specific metrics. These language-aware features, such as goroutine_count, channel_count, and error_handling_count, were accurately extracted by performing static analysis on the Abstract Syntax Tree (AST) of each source file.The ground truth for labeling was established using a high-confidence, reproducible methodology. We first identified bug-fixing commits by leveraging human-curated \bug\ labels from GitHub Pull Requests. Subsequently, we applied a refactoring-aware version of the Sliwerski-Zimmermann-Zeller (SZZ) algorithm to pinpoint the corresponding bug-introducing commits.This dataset is intended to serve as a foundational, standardized benchmark to enable and accelerate reproducible research into software quality, testing, and defect prediction within the Go ecosystem.

为满足现代软件生态系统对实证研究资源日益增长的需求，我们推出GoBug——首个面向Go编程语言（Go）的大规模、多粒度软件缺陷预测（Software Defect Prediction, SDP）数据集。该数据集采集自GitHub平台上16个极具影响力的头部开源Go项目，涵盖Kubernetes、Terraform与InfluxDB等知名项目。GoBug提供三种不同粒度的标注数据：提交（commit）级、文件级与方法级。该数据集的核心亮点在于其丰富全面的指标体系，整合了传统过程指标（如代码变更活跃度（code churn）、作者数量）与静态代码指标（如圈复杂度、代码行数（LOC）），并新增了一套专属Go语言的特色指标集。这些面向语言特性的特征，如goroutine_count、channel_count及error_handling_count，通过对每个源文件的抽象语法树（Abstract Syntax Tree, AST）执行静态分析得以精准提取。标注所用的基准真值（ground truth）通过一套高可信度、可复现的方法确立：我们首先借助GitHub拉取请求中人工标注的`bug`标签，识别出修复缺陷的代码提交；随后，我们应用了适配重构场景的Sliwerski-Zimmermann-Zeller（SZZ）算法，以精确定位对应的缺陷引入提交。本数据集旨在作为一套基础性、标准化的基准测试集，助力并加速Go生态系统内软件质量、软件测试与缺陷预测领域的可复现研究。

提供机构：

Emre Can Yılmaz

5,000+

优质数据集

54 个

任务类型

进入经典数据集