five

Defect Prediction Tool Validation Dataset 1

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/3906022
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset provides IaC-oriented, delta and process metrics extracted from open-source GitHub repositories based on the Ansible language, as well as the corresponding validated defect-prediction models. Context Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion.  On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed. In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages. This dataset targets the YAML-based Ansible language to devise **defects prediction** approaches for IaC based on Machine-learning. Content The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria: * The repository has at least one push event to its master branch in the last six months; * The repository has at least 2 releases; * At least 11% of the files in the repository are IaC scripts; * The repository has at least 2 core contributors; * The repository has evidence of continuous integration practice, such as the presence of a  .travis.yaml file; * The repository has a comments ratio of at least 0.2%; * The repository has commit frequency of at least 2 per month on average; * The repository has an issue frequency of at least 0.023 events per month on average; * The repository has evidence of a license, such as the presence of a LICENSE.md file * The repository has at least 190 source lines of code. Metrics are grouped into three categories: * IaC-oriented: metrics of structural properties derived from the source code of infrastructure scripts; * Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric; * Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found [here](https://pydriller.readthedocs.io/en/latest/processmetrics.html). In addition to the metrics, the dataset contains the pre-trained models (one per repositories) in the folder models.zip. The folder contains the following two files for each repository; * a pickle file containing the pre-trained model. It has the following naming convention: .pkl    To load the model in python use the following:  ``` from joblib import load model = load('path/to/the/pickle/file.pkl'), mmap_mode='r') ``` * a son file containing the attributes used to train the model. Those attributed are needed to use the corresponding model correctly. It has the following naming convention: .json. To use the model after loading it as described above, use: ``` import json with open('path/to/the/pickle/file.json'), 'r') as in_file:         model_features = json.load(in_file) ``` You can then collect and pass those features to the model for the final prediction.   Acknowledgements Thanks to the open-source community :) Inspiration What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?   Note Further information about the dataset and the validation results can be found at: https://github.com/stefanodallapalma/Defect-Prediction-IaC-TSE2020-replication-package
创建时间:
2020-12-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作