Manually categorized defect-related commits in Programming-Language Infrastructure-as-Code (PL-IaC) repositories
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/r53r882ygt
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was produced as part of a replication study that explored whether a defect taxonomy originally developed for Infrastructure-as-Code (IaC) scripts would also apply to Programming-Language-based Infrastructure-as-Code (PL-IaC) projects. The research aimed to investigate which defect categories would appear in PL-IaC code written in general-purpose languages such as TypeScript, Python, and Go.
To conduct this investigation, we used the PIPr dataset, the first systematically curated dataset that gathers open-source repositories employing Pulumi, AWS CDK, or Terraform CDK. For our purpose, repositories were selected using four inclusion criteria: (i) at least 11% PL-IaC files, (ii) non-fork status, (iii) at least two commits per month, and (iv) at least ten contributors. From these repositories, we extracted commit histories and applied a set of defect-related keywords to identify potentially relevant commits. This keyword filtering step produced 5,465 candidate commits.
Each commit was then manually inspected in a structured three-phase process. First, we verified whether each commit was genuinely defect-related, which yielded 2,220 confirmed defect-related commits. The remaining 3,245 commits, despite matching defect-related keywords, were determined not to represent defect fixes or introductions and were labeled as No Defect. Second, both authors independently performed descriptive coding of defect-related commits to identify the nature and cause of each defect. Third, we reconciled our individual codings and found that the same eight defect categories identified in the original IaC study—Conditional, Configuration Data, Dependency, Documentation, Idempotency, Security, Service, and Syntax—also appeared in the PL-IaC context. These categories are not mutually exclusive, and some commits exhibit multiple defect types separated by semicolon.
The resulting dataset consists of two UTF-8 encoded CSV files:
* categorized-pl-iac-commits.csv — all 5,465 filtered commits, including their GitHub links, full commit text (metadata and diff), and one or more defect category labels.
* repositories.csv — metadata describing the 67 source repositories, including URLs, license information, timestamps, and repository characteristics.
The findings show that the defect taxonomy originally defined for IaC scripts is also valid for PL-IaC projects, indicating that both types of infrastructure code share similar defect patterns despite being implemented in different language paradigms.
Researchers can use this dataset to replicate or extend taxonomy validation studies, train and evaluate machine learning models for defect detection, develop static analyzers and linters for PL-IaC, or conduct cross-language analyses of defect characteristics and automated repair techniques.
创建时间:
2025-11-06



