Replication Package: "Insights into Security-Related AI-Generated Pull Requests"

Figshare2025-10-22 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Replication_Package_Insights_into_Security-Related_AI-Generated_Pull_Requests_/30421996

下载链接

链接失效反馈

官方服务：

资源简介：

This replication package contains all datasets, scripts, and documentation used in our empirical study.The package is organized into five main folders, corresponding to the major stages of the study: dataset construction and the analyses for RQ1–RQ4.1. Dataset_constructionThis folder contains scripts for constructing the dataset.find_ai_prs_with_100_stars.pyFilters PRs from repositories with at least 100 GitHub stars.Output: ai_pull_requests_over_100stars.csvapplying_security_keywords.pyExpands the filtered dataset by applying a comprehensive list of security-related keywords.Output: ai_prs_security_candidates_expanded.csvapplying_gemini_to_get_final_dataset.pyUses Gemini-based model validation to check whether each candidate PR is security-related. It saves the results to an intermediate file named ai_prs_all_classified_gemini.xlsx, and then uses that file to select the PRs labeled as security ‘yes’ and store them in final_dataset.csv.Output: final_dataset.csvannotator_security_prs_sample_245.xlsxContains a manually annotated subset (n=245) of PRs used for validation of model predictions.dataset_agreement_analysis.pyComputes inter-annotator agreement metrics (Cohen’s κ) for the manual sample.2. RQ1_analysisThis folder contains scripts used to answer RQ1.Input: final_dataset.csvrun_semgrep.pyRuns Semgrep across all PR code changes.Output: all_prs_with_semgrep.csvRQ1_analysis.pyAggregates and analyzes vulnerability types to address RQ1.3. RQ2_analysisThis folder analyzes RQ2.Subfolder: feature_extraction/Input: final_dataset.csvfind_factors.pyExtracts PR- and repository-level features.Output: ai_factors.csvSubfolder: regression_analysis/Input: ai_factors.csvPR_latency.R, PR_acceptance.R, and common.RPerform regression analyses.4. RQ3_analysisThis folder contains scripts for RQ3.find_commits.pyExtracts all commits associated with PRs listed in final_dataset.csv.Output: ai_security_prs_with_commits.csvC-Good.pyReplicates the pretrained commit message quality model (C-Good) proposed by Tian et al.preprocessor_step1.py, preprocessor_step2.py, preprocessor_step3.pySequentially preprocess the original messages.csv dataset (from Tian et al.) for model training.Output: trained model bert_commit_model.pthpreprocessing_my_dataset.pyApplies the same preprocessing pipeline to ai_security_prs_with_commits.csv.Output: ai_security_prs_with_commits_preprocessed.csvtesting_my_dataset.pyLoads bert_commit_model.pth, evaluates commits, and produces commit-level quality labels.Output: ai_security_prs_with_commits_predictions.csvcommit_message_sample_339.csvContains a random sample of 339 commit messages manually reviewed to verify the accuracy of the model predictions.manual_verification.pyAnalyzes commit_message_sample_339.csv by comparing manual ratings with model-predicted labels.rq3_analysis.pyPerforms analysis of commit message quality.5. RQ4_analysisThis folder contains scripts and resources for RQ4.find_rejected_prs_comments.pyIdentifies PRs closed without merging from final_dataset.csv.Collects maintainer review comments.Output: rejected_pr_comments.csvRubrics.docxDefines categories and guidelines for manual annotation.Manual AnnotationTwo annotators manually review rejected_pr_comments.csv following the rubric.Output: manual_labeling.xlsxrq4_analysis.pyAnalyzes the annotated dataset (manual_labeling.xlsx) .

创建时间：

2025-10-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集