Replication Package: "Insights into Security-Related AI-Generated Pull Requests"
收藏Figshare2025-10-22 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Replication_Package_Insights_into_Security-Related_AI-Generated_Pull_Requests_/30421996
下载链接
链接失效反馈官方服务:
资源简介:
This replication package contains all datasets, scripts, and documentation used in our empirical study.The package is organized into five main folders, corresponding to the major stages of the study: dataset construction and the analyses for RQ1–RQ4.1. Dataset_constructionThis folder contains scripts for constructing the dataset.find_ai_prs_with_100_stars.pyFilters PRs from repositories with at least 100 GitHub stars.Output: ai_pull_requests_over_100stars.csvapplying_security_keywords.pyExpands the filtered dataset by applying a comprehensive list of security-related keywords.Output: ai_prs_security_candidates_expanded.csvapplying_gemini_to_get_final_dataset.pyUses Gemini-based model validation to check whether each candidate PR is security-related. It saves the results to an intermediate file named ai_prs_all_classified_gemini.xlsx, and then uses that file to select the PRs labeled as security ‘yes’ and store them in final_dataset.csv.Output: final_dataset.csvannotator_security_prs_sample_245.xlsxContains a manually annotated subset (n=245) of PRs used for validation of model predictions.dataset_agreement_analysis.pyComputes inter-annotator agreement metrics (Cohen’s κ) for the manual sample.2. RQ1_analysisThis folder contains scripts used to answer RQ1.Input: final_dataset.csvrun_semgrep.pyRuns Semgrep across all PR code changes.Output: all_prs_with_semgrep.csvRQ1_analysis.pyAggregates and analyzes vulnerability types to address RQ1.3. RQ2_analysisThis folder analyzes RQ2.Subfolder: feature_extraction/Input: final_dataset.csvfind_factors.pyExtracts PR- and repository-level features.Output: ai_factors.csvSubfolder: regression_analysis/Input: ai_factors.csvPR_latency.R, PR_acceptance.R, and common.RPerform regression analyses.4. RQ3_analysisThis folder contains scripts for RQ3.find_commits.pyExtracts all commits associated with PRs listed in final_dataset.csv.Output: ai_security_prs_with_commits.csvC-Good.pyReplicates the pretrained commit message quality model (C-Good) proposed by Tian et al.preprocessor_step1.py, preprocessor_step2.py, preprocessor_step3.pySequentially preprocess the original messages.csv dataset (from Tian et al.) for model training.Output: trained model bert_commit_model.pthpreprocessing_my_dataset.pyApplies the same preprocessing pipeline to ai_security_prs_with_commits.csv.Output: ai_security_prs_with_commits_preprocessed.csvtesting_my_dataset.pyLoads bert_commit_model.pth, evaluates commits, and produces commit-level quality labels.Output: ai_security_prs_with_commits_predictions.csvcommit_message_sample_339.csvContains a random sample of 339 commit messages manually reviewed to verify the accuracy of the model predictions.manual_verification.pyAnalyzes commit_message_sample_339.csv by comparing manual ratings with model-predicted labels.rq3_analysis.pyPerforms analysis of commit message quality.5. RQ4_analysisThis folder contains scripts and resources for RQ4.find_rejected_prs_comments.pyIdentifies PRs closed without merging from final_dataset.csv.Collects maintainer review comments.Output: rejected_pr_comments.csvRubrics.docxDefines categories and guidelines for manual annotation.Manual AnnotationTwo annotators manually review rejected_pr_comments.csv following the rubric.Output: manual_labeling.xlsxrq4_analysis.pyAnalyzes the annotated dataset (manual_labeling.xlsx) .
创建时间:
2025-10-22



