replication package for the paper " What Do Infrastructure-as-Code Practitioners Discuss: An empirical Study on Stack Overflow"
收藏DataCite Commons2023-07-08 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/replication_package_for_the_paper_What_Do_Infrastructure-as-Code_Practitioners_Discuss_An_empirical_Study_on_Stack_Overflow_/22734890
下载链接
链接失效反馈官方服务:
资源简介:
<strong>Replication Package of the Empirical Study</strong> <br> Title: "<em>What Do Infrastructure-as-Code Practitioners Discuss: An Empirical Study on Stack Overflow</em>" <br> Authors: Mahi BEGOUG, Narjes Bessghaier, Ali Ouni, Eman Abdullah AlOmar, Mohamed Wiem Mkaouer. <br> This replication package includes the following folders: <br> <strong>study_methodology</strong>: This folder contains sections explaining the extraction of IaC tags, post extraction and cleaning, and the application of LDA topic modeling. It consists of three subfolders:<br> a. <em>Extract_IaC_Tags</em>: This folder extracts IaC tags using relevance and significance metrics. The file "iac_tag_filtering.xlsx" in the data folder contains the agreement on the selected tags.<br> b. <em>Extract_Clean_IaC_Posts</em>: This folder extracts IaC-related posts from the "iac_dataset.csv" file and performs cleaning to remove irrelevant information.<br> c. <em>Apply_Topic_Modeling</em>: This folder applies LDA topic modeling. The trained model is stored in the "saved_model" folder. Additionally, the "Adapt_Genetic_Algorithm_GA" folder contains the implementation of Genetic Algorithm with LDA (see <em>ga_bootstrap.ipynb</em>). For the LDA, we used the Mallet framework. For the topic coherence, we use Gensim framework which provides the coherence model to measure the quality of topics. We set the <em>coherence parameter</em> of the coherence model at 'c_v' .for the GA, we adapt the implementation used by CISO. <strong>RQ1 folder:</strong> This folder contains the script that measures the evolution of IaC questions and the users involved in IaC discussions from 2011 to 2012, presenting the results for Research Question 1 (RQ1). <strong>RQ2 folder:</strong> This folder includes a data file named "RQ2_manual_analysis_30_random_samples.xlsx," which provides details about our labeling of IaC topics. The script "rq2.ipynb" measures the number of questions for each topic, presenting the results for Research Question 2 (RQ2). <strong>RQ3 folder:</strong> This folder contains the script that computes the difficulty and popularity metrics, presenting the results for Research Question 3 (RQ3). <br> For any suggestation and improvement, please contact us at the address: mahi.begoug.1[at]ens.etsmtl.ca
本实证研究的复现包
标题:"基础设施即代码(Infrastructure-as-Code)从业者的讨论主题:基于堆栈溢出(Stack Overflow)的实证研究"
作者:Mahi BEGOUG、Narjes Bessghaier、Ali Ouni、Eman Abdullah AlOmar、Mohamed Wiem Mkaouer。
本复现包包含以下文件夹:
**study_methodology文件夹**:该文件夹包含阐释基础设施即代码标签提取、提取后处理与清洗,以及潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)主题建模流程的章节,下设三个子文件夹:
a. `Extract_IaC_Tags`子文件夹:该文件夹通过相关性与显著性指标提取基础设施即代码标签。`data`文件夹下的`iac_tag_filtering.xlsx`文件包含针对选定标签的一致性评估结果。
b. `Extract_Clean_IaC_Posts`子文件夹:该文件夹从`iac_dataset.csv`文件中提取与基础设施即代码相关的帖子,并执行清洗操作以移除无关信息。
c. `Apply_Topic_Modeling`子文件夹:该文件夹用于执行LDA主题建模,训练好的模型存储于`saved_model`文件夹中。此外,`Adapt_Genetic_Algorithm_GA`文件夹包含结合LDA的遗传算法(Genetic Algorithm, GA)实现(详见`ga_bootstrap.ipynb`)。本次LDA建模采用Mallet框架;主题一致性评估则采用Gensim框架,该框架提供了用于量化主题质量的一致性模型。我们将一致性模型的`coherence parameter`(一致性参数)设置为`'c_v'`。针对遗传算法的实现,我们复用了CISO所采用的实现方案。
**RQ1文件夹**:该文件夹包含用于量化2011至2012年间基础设施即代码相关问题的演化趋势以及参与相关讨论的用户群体的脚本,用于呈现研究问题1(RQ1)的结果。
**RQ2文件夹**:该文件夹包含名为`RQ2_manual_analysis_30_random_samples.xlsx`的数据文件,该文件详细记录了我们对基础设施即代码主题的标注工作。脚本`rq2.ipynb`用于统计每个主题对应的问题数量,以呈现研究问题2(RQ2)的结果。
**RQ3文件夹**:该文件夹包含用于计算难度与流行度指标的脚本,用于呈现研究问题3(RQ3)的结果。
如有任何建议与改进意见,请通过以下邮箱联系我们:mahi.begoug.1[at]ens.etsmtl.ca
提供机构:
figshare
创建时间:
2023-07-08



