Manually categorized defect-related commits in Programming-Language Infrastructure-as-Code (PL-IaC) repositories
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/r53r882ygt
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was produced as part of a replication study that explored whether a defect taxonomy originally developed for Infrastructure-as-Code (IaC) scripts would also apply to Programming-Language-based Infrastructure-as-Code (PL-IaC) projects. The research aimed to investigate which defect categories would appear in PL-IaC code written in general-purpose languages such as TypeScript, Python, and Go.
To conduct this investigation, we used the PIPr dataset, the first systematically curated dataset that gathers open-source repositories employing Pulumi, AWS CDK, or Terraform CDK. For our purpose, repositories were selected using four inclusion criteria: (i) at least 11% PL-IaC files, (ii) non-fork status, (iii) at least two commits per month, and (iv) at least ten contributors. From these repositories, we extracted commit histories and applied a set of defect-related keywords to identify potentially relevant commits. This keyword filtering step produced 5,465 candidate commits.
Each commit was then manually inspected in a structured three-phase process. First, we verified whether each commit was genuinely defect-related, which yielded 2,220 confirmed defect-related commits. The remaining 3,245 commits, despite matching defect-related keywords, were determined not to represent defect fixes or introductions and were labeled as No Defect. Second, both authors independently performed descriptive coding of defect-related commits to identify the nature and cause of each defect. Third, we reconciled our individual codings and found that the same eight defect categories identified in the original IaC study—Conditional, Configuration Data, Dependency, Documentation, Idempotency, Security, Service, and Syntax—also appeared in the PL-IaC context. These categories are not mutually exclusive, and some commits exhibit multiple defect types separated by semicolon.
The resulting dataset consists of two UTF-8 encoded CSV files:
* categorized-pl-iac-commits.csv — all 5,465 filtered commits, including their GitHub links, full commit text (metadata and diff), and one or more defect category labels.
* repositories.csv — metadata describing the 67 source repositories, including URLs, license information, timestamps, and repository characteristics.
The findings show that the defect taxonomy originally defined for IaC scripts is also valid for PL-IaC projects, indicating that both types of infrastructure code share similar defect patterns despite being implemented in different language paradigms.
Researchers can use this dataset to replicate or extend taxonomy validation studies, train and evaluate machine learning models for defect detection, develop static analyzers and linters for PL-IaC, or conduct cross-language analyses of defect characteristics and automated repair techniques.
本数据集源自一项复现研究,该研究旨在探究最初针对基础设施即代码(Infrastructure-as-Code, IaC)脚本开发的缺陷分类体系,是否同样适用于基于编程语言的基础设施即代码(Programming-Language-based Infrastructure-as-Code, PL-IaC)项目。本研究旨在探明在TypeScript、Python、Go等通用编程语言编写的PL-IaC代码中,会出现哪些缺陷类别。
为开展此项研究,我们使用了PIPr数据集——这是首个经过系统性整理的数据集,收录了采用Pulumi、AWS CDK或Terraform CDK的开源代码仓库。本研究依据四项纳入标准筛选代码仓库:(i) 包含至少11%的PL-IaC文件;(ii) 非复刻仓库;(iii) 每月至少有两次提交;(iv) 至少有十名贡献者。我们从筛选出的仓库中提取提交历史,并通过一组与缺陷相关的关键词识别潜在相关提交。经关键词过滤后,共得到5465条候选提交。
随后,我们通过结构化的三阶段流程对每条提交进行人工审查。首先,我们验证每条提交是否确实与缺陷相关,最终确认2220条与缺陷相关的提交。剩余的3245条提交虽匹配缺陷相关关键词,但经判定不属于缺陷修复或缺陷引入操作,被标记为“无缺陷”。其次,两名研究者分别对缺陷相关提交进行描述性编码,以识别每种缺陷的性质与成因。第三,我们对各自的编码结果进行协调,发现最初的IaC研究中确定的八大缺陷类别——条件缺陷、配置数据缺陷、依赖缺陷、文档缺陷、幂等性缺陷、安全缺陷、服务缺陷与语法缺陷——同样出现在PL-IaC场景中。这些类别并非互斥,部分提交存在多种缺陷类型,以分号分隔。
本数据集包含两个UTF-8编码的CSV文件:
* categorized-pl-iac-commits.csv — 包含全部5465条过滤后的提交,附带其GitHub链接、完整提交文本(元数据与差异内容),以及一个或多个缺陷类别标签。
* repositories.csv — 描述67个源仓库的元数据,包括仓库URL、许可证信息、时间戳与仓库特征。
本研究结果表明,最初为IaC脚本定义的缺陷分类体系同样适用于PL-IaC项目,这说明尽管两种基础设施代码采用不同的语言范式实现,但二者拥有相似的缺陷模式。
研究者可利用本数据集开展复现或扩展分类体系验证的研究、训练并评估用于缺陷检测的机器学习模型、开发针对PL-IaC的静态分析工具与代码检查工具,或开展缺陷特征与自动化修复技术的跨语言分析。
创建时间:
2025-11-06



