five

Replication Package: "Insights into Security-Related AI-Generated Pull Requests"

收藏
DataCite Commons2025-12-08 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/Replication_Package_Insights_into_Security-Related_AI-Generated_Pull_Requests_/30421996
下载链接
链接失效反馈
官方服务:
资源简介:
This replication package contains all datasets, scripts, and documentation used in our empirical study.<br>The package is organized into five main folders, corresponding to the major stages of the study: dataset construction and the analyses for RQ1–RQ4.<br><b>1. Dataset_construction</b>This folder contains scripts for constructing the dataset.<b>find_ai_prs_with_100_stars.py</b>Filters PRs from repositories with at least 100 GitHub stars.<b>Output:</b> <code>ai_pull_requests_over_100stars.csv</code><b>applying_security_keywords.py</b>Expands the filtered dataset by applying a comprehensive list of security-related keywords.<b>Output:</b> <code>ai_prs_security_candidates_expanded.csv</code><b>applying_gemini_to_get_final_dataset.py</b>Uses Gemini-based model validation to check whether each candidate PR is security-related. It saves the results to an intermediate file named <code>ai_prs_all_classified_gemini.xlsx</code>, and then uses that file to select the PRs labeled as security ‘yes’ and store them in <code>final_dataset.csv</code>.<b>Output:</b> <code>final_dataset.csv</code><b>annotator_security_prs_sample_245.xlsx</b>Contains a manually annotated subset (n=245) of PRs used for validation of model predictions.<b>dataset_agreement_analysis.py</b>Computes inter-annotator agreement metrics (Cohen’s κ) for the manual sample.<b>2. RQ1_analysis</b>This folder contains scripts used to answer <b>RQ1.</b><b>Input:</b> <code>final_dataset.csv</code><b>run_semgrep.py</b>Runs Semgrep across all PR code changes.<b>Output:</b> <code>all_prs_with_semgrep.csv</code><b>RQ1_analysis.py</b>Aggregates and analyzes vulnerability types to address RQ1.<b>3. RQ2_analysis</b>This folder analyzes <b>RQ2.</b><b>Subfolder: feature_extraction/</b><b>Input:</b> <code>final_dataset.csv</code><b>find_factors.py</b>Extracts PR- and repository-level features.<b>Output:</b> <code>ai_factors.csv</code><b>Subfolder: regression_analysis/</b><b>Input:</b> <code>ai_factors.csv</code><b>PR_latency.R</b>, <b>PR_acceptance.R</b>, and <b>common.R</b>Perform regression analyses.<b>4. RQ3_analysis</b>This folder contains scripts for <b>RQ3.</b><b>find_commits.py</b><br>Extracts all commits associated with PRs listed in <code>final_dataset.csv</code>.<br><b>Output:</b> <code>ai_security_prs_with_commits.csv</code><b>C-Good.py</b><br>Replicates the pretrained commit message quality model (C-Good) proposed by Tian et al.<b>preprocessor_step1.py</b>, <b>preprocessor_step2.py</b>, <b>preprocessor_step3.py</b><br>Sequentially preprocess the original <code>messages.csv</code> dataset (from Tian et al.) for model training.<br><b>Output:</b> trained model <code>bert_commit_model.pth</code><b>preprocessing_my_dataset.py</b><br>Applies the same preprocessing pipeline to <code>ai_security_prs_with_commits.csv</code>.<br><b>Output:</b> <code>ai_security_prs_with_commits_preprocessed.csv</code><b>testing_my_dataset.py</b><br>Loads <code>bert_commit_model.pth</code>, evaluates commits, and produces commit-level quality labels.<br><b>Output:</b> <code>ai_security_prs_with_commits_predictions.csv</code><b>commit_message_sample_339.csv</b><br>Contains a random sample of 339 commit messages manually reviewed to verify the accuracy of the model predictions.<b>manual_verification.py</b><br>Analyzes <code>commit_message_sample_339.csv</code> by comparing manual ratings with model-predicted labels.<b>rq3_analysis.py</b><br>Performs analysis of commit message quality.<b>5. RQ4_analysis</b>This folder contains scripts and resources for <b>RQ4.</b><b>find_rejected_prs_comments.py</b>Identifies PRs closed without merging from <code>final_dataset.csv</code>.Collects maintainer review comments.<b>Output:</b> <code>rejected_pr_comments.csv</code><b>Rubrics.docx</b>Defines categories and guidelines for manual annotation.<b>Manual Annotation</b>Two annotators manually review <code>rejected_pr_comments.csv</code> following the rubric.<b>Output:</b> <code>manual_labeling.xlsx</code><b>rq4_analysis.py</b>Analyzes the annotated dataset (<code>manual_labeling.xlsx</code>) .<br>
提供机构:
figshare
创建时间:
2025-10-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作