five

Manually categorized defect-related commits in Programming-Language Infrastructure-as-Code (PL-IaC) repositories

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/r53r882ygt
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset was produced as part of a replication study that explored whether a defect taxonomy originally developed for Infrastructure-as-Code (IaC) scripts would also apply to Programming-Language-based Infrastructure-as-Code (PL-IaC) projects. The research aimed to investigate which defect categories would appear in PL-IaC code written in general-purpose languages such as TypeScript, Python, and Go. To conduct this investigation, we used the PIPr dataset, the first systematically curated dataset that gathers open-source repositories employing Pulumi, AWS CDK, or Terraform CDK. For our purpose, repositories were selected using four inclusion criteria: (i) at least 11% PL-IaC files, (ii) non-fork status, (iii) at least two commits per month, and (iv) at least ten contributors. From these repositories, we extracted commit histories and applied a set of defect-related keywords to identify potentially relevant commits. This keyword filtering step produced 5,465 candidate commits. Each commit was then manually inspected in a structured three-phase process. First, we verified whether each commit was genuinely defect-related, which yielded 2,220 confirmed defect-related commits. The remaining 3,245 commits, despite matching defect-related keywords, were determined not to represent defect fixes or introductions and were labeled as No Defect. Second, both authors independently performed descriptive coding of defect-related commits to identify the nature and cause of each defect. Third, we reconciled our individual codings and found that the same eight defect categories identified in the original IaC study—Conditional, Configuration Data, Dependency, Documentation, Idempotency, Security, Service, and Syntax—also appeared in the PL-IaC context. These categories are not mutually exclusive, and some commits exhibit multiple defect types separated by semicolon. The resulting dataset consists of two UTF-8 encoded CSV files: * categorized-pl-iac-commits.csv — all 5,465 filtered commits, including their GitHub links, full commit text (metadata and diff), and one or more defect category labels. * repositories.csv — metadata describing the 67 source repositories, including URLs, license information, timestamps, and repository characteristics. The findings show that the defect taxonomy originally defined for IaC scripts is also valid for PL-IaC projects, indicating that both types of infrastructure code share similar defect patterns despite being implemented in different language paradigms. Researchers can use this dataset to replicate or extend taxonomy validation studies, train and evaluate machine learning models for defect detection, develop static analyzers and linters for PL-IaC, or conduct cross-language analyses of defect characteristics and automated repair techniques.
创建时间:
2025-11-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作