Replication Package: "Insights into Security-Related AI-Generated Pull Requests"

Name: Replication Package: "Insights into Security-Related AI-Generated Pull Requests"
Creator: figshare
Published: 2025-12-08 21:29:00
License: 暂无描述

DataCite Commons2025-12-08 更新2026-04-25 收录

下载链接：

https://figshare.com/articles/dataset/Replication_Package_Insights_into_Security-Related_AI-Generated_Pull_Requests_/30421996

下载链接

链接失效反馈

官方服务：

资源简介：

This replication package contains all datasets, scripts, and documentation used in our empirical study. The package is organized into five main folders, corresponding to the major stages of the study: dataset construction and the analyses for RQ1–RQ4. 1. Dataset_constructionThis folder contains scripts for constructing the dataset.find_ai_prs_with_100_stars.pyFilters PRs from repositories with at least 100 GitHub stars.Output: <code>ai_pull_requests_over_100stars.csv</code>applying_security_keywords.pyExpands the filtered dataset by applying a comprehensive list of security-related keywords.Output: <code>ai_prs_security_candidates_expanded.csv</code>applying_gemini_to_get_final_dataset.pyUses Gemini-based model validation to check whether each candidate PR is security-related. It saves the results to an intermediate file named <code>ai_prs_all_classified_gemini.xlsx</code>, and then uses that file to select the PRs labeled as security ‘yes’ and store them in <code>final_dataset.csv</code>.Output: <code>final_dataset.csv</code>annotator_security_prs_sample_245.xlsxContains a manually annotated subset (n=245) of PRs used for validation of model predictions.dataset_agreement_analysis.pyComputes inter-annotator agreement metrics (Cohen’s κ) for the manual sample.2. RQ1_analysisThis folder contains scripts used to answer RQ1.Input: <code>final_dataset.csv</code>run_semgrep.pyRuns Semgrep across all PR code changes.Output: <code>all_prs_with_semgrep.csv</code>RQ1_analysis.pyAggregates and analyzes vulnerability types to address RQ1.3. RQ2_analysisThis folder analyzes RQ2.Subfolder: feature_extraction/Input: <code>final_dataset.csv</code>find_factors.pyExtracts PR- and repository-level features.Output: <code>ai_factors.csv</code>Subfolder: regression_analysis/Input: <code>ai_factors.csv</code>PR_latency.R, PR_acceptance.R, and common.RPerform regression analyses.4. RQ3_analysisThis folder contains scripts for RQ3.find_commits.py Extracts all commits associated with PRs listed in <code>final_dataset.csv</code>. Output: <code>ai_security_prs_with_commits.csv</code>C-Good.py Replicates the pretrained commit message quality model (C-Good) proposed by Tian et al.preprocessor_step1.py, preprocessor_step2.py, preprocessor_step3.py Sequentially preprocess the original <code>messages.csv</code> dataset (from Tian et al.) for model training. Output: trained model <code>bert_commit_model.pth</code>preprocessing_my_dataset.py Applies the same preprocessing pipeline to <code>ai_security_prs_with_commits.csv</code>. Output: <code>ai_security_prs_with_commits_preprocessed.csv</code>testing_my_dataset.py Loads <code>bert_commit_model.pth</code>, evaluates commits, and produces commit-level quality labels. Output: <code>ai_security_prs_with_commits_predictions.csv</code>commit_message_sample_339.csv Contains a random sample of 339 commit messages manually reviewed to verify the accuracy of the model predictions.manual_verification.py Analyzes <code>commit_message_sample_339.csv</code> by comparing manual ratings with model-predicted labels.rq3_analysis.py Performs analysis of commit message quality.5. RQ4_analysisThis folder contains scripts and resources for RQ4.find_rejected_prs_comments.pyIdentifies PRs closed without merging from <code>final_dataset.csv</code>.Collects maintainer review comments.Output: <code>rejected_pr_comments.csv</code>Rubrics.docxDefines categories and guidelines for manual annotation.Manual AnnotationTwo annotators manually review <code>rejected_pr_comments.csv</code> following the rubric.Output: <code>manual_labeling.xlsx</code>rq4_analysis.pyAnalyzes the annotated dataset (<code>manual_labeling.xlsx</code>) .

提供机构：

figshare

创建时间：

2025-10-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集