NASA MDP datasets, GitHub datasets

Name: NASA MDP datasets, GitHub datasets
Creator: 肯特商学院, 英国肯特大学
Published: 2019-01-07 17:48:33
License: 暂无描述

arXiv2019-01-07 更新2024-06-21 收录

下载链接：

http://openscience.us/repo/defect/

下载链接

链接失效反馈

官方服务：

资源简介：

本研究使用了NASA MDP和GitHub两个数据集，共计27个数据集，用于评估软件缺陷预测性能。NASA MDP数据集包含多个NASA项目的缺陷数据，而GitHub数据集则从GitHub代码库中收集，涵盖了多种类型的软件项目。这些数据集用于训练和测试不同的分类模型，以预测软件缺陷。研究通过这些数据集探讨了不同评估指标和测试程序对预测准确性的影响，并提出了改进的基准配置。

This study employs two dataset suites, NASA MDP and GitHub, which collectively comprise 27 individual datasets, to evaluate the performance of software defect prediction. The NASA MDP dataset contains defect data from multiple NASA projects, while the GitHub dataset is collected from GitHub code repositories and covers a wide range of software project types. These datasets are utilized to train and test different classification models for software defect prediction. Through these datasets, the study explores the impacts of different evaluation metrics and test protocols on prediction accuracy, and proposes an improved baseline configuration.

提供机构：

肯特商学院, 英国肯特大学

创建时间：

2019-01-07

搜集汇总

数据集介绍

构建方式

NASA MDP datasets and GitHub datasets were utilized to address the class imbalance problem through oversampling and undersampling techniques. This approach aimed to improve the predictive accuracy of faulty software units. The datasets were collected from the MDP project, which contained defect datasets from various NASA artifacts, and from the GitHub repository, which provided a wealth of open source software projects. To address data quality issues, preprocessing steps were implemented to remove duplicate records and inconsistencies, ensuring the reliability of the data. Additionally, the study employed advanced sampling techniques, such as ADASYN, to balance the class distribution and enhance the usability of the datasets.

特点

The NASA MDP datasets and GitHub datasets offer valuable insights into software defect prediction. They exhibit varying qualities in terms of the number of faulty units, making them suitable for addressing the class imbalance problem. The datasets provide a diverse range of software projects, including control software for observers and open source projects from GitHub. This diversity allows researchers to investigate different software domains and gain a broader understanding of software defect prediction. The datasets are accompanied by detailed documentation, including information on the number of observations, variables, and fault observations, facilitating ease of use and analysis. Furthermore, the datasets support the evaluation of predictive accuracy using metrics such as AUC and H measure, enabling researchers to assess the performance of different classifiers and identify the most effective ones.

使用方法

To utilize the NASA MDP datasets and GitHub datasets, researchers can follow a step-by-step approach. First, the datasets can be downloaded from their respective repositories. Once obtained, data preprocessing steps should be performed to address class imbalance and data quality issues. This may involve oversampling and undersampling techniques, as well as the removal of duplicate records and inconsistencies. After preprocessing, researchers can split the datasets into training and testing sets using a cross-validation approach, such as five-fold cross-validation. This allows for the evaluation of predictive accuracy using metrics like AUC and H measure. Researchers can then apply various classifiers, such as Bayesian approaches, tree-based approaches, support vector machine approaches, neural network approaches, boosting approaches, and others, to build prediction models. The models can be fine-tuned using parameter optimization techniques to improve their performance. Finally, the predictive accuracy of the classifiers can be assessed using statistical tests, such as the Friedman test, post-hoc tests, and Bayesian tests, to determine the significance of performance differences between classifiers. This comprehensive approach enables researchers to gain valuable insights into software defect prediction and select the most suitable classifiers for their specific needs.

背景与挑战

背景概述

在软件工程领域，准确预测软件缺陷是提高软件质量和效率的关键。NASA MDP数据集，由NASA的多个项目提供，是软件缺陷预测研究的重要基准数据集。该数据集包含了来自NASA各种项目的历史软件缺陷数据，为研究人员提供了丰富的实验数据。同时，随着GitHub等开源平台的兴起，研究人员开始使用GitHub上的软件项目数据集来补充传统数据集。这些数据集的创建，为软件缺陷预测研究提供了新的视角和更多的数据支持。

当前挑战

在软件缺陷预测研究中，存在一些挑战。首先，软件缺陷数据集可能存在数据质量问题，例如缺失值、重复记录等。其次，软件缺陷数据集可能存在类别不平衡问题，即缺陷类别的样本数量远小于正常类别。此外，软件缺陷预测模型的评估指标和方法也存在争议，例如ROC曲线和AUC值的适用性。最后，软件缺陷预测模型的复杂性和可解释性之间也存在权衡。

常用场景

经典使用场景

在软件开发过程中，准确预测缺陷代码是软件分析的关键方面。该数据集常用于评估软件缺陷预测性能，通过机器学习模型检测有缺陷的软件代码。经典的使用场景包括构建分类模型来预测软件单元是否可能存在缺陷，以便开发者和管理者可以优先处理这些缺陷，提高软件质量。

衍生相关工作

该数据集衍生了与软件缺陷预测相关的许多经典工作，如使用贝叶斯方法、基于树的分类方法、支持向量机方法、神经网络方法和提升方法等来预测软件缺陷。此外，该数据集还提出了新的评估指标H-measure，以解决AUC指标的潜在局限性，并提供了一种新的贝叶斯测试程序，以获得对软件缺陷预测性能的更深入理解。

数据集最近研究