Code review regression analysis of open source GitHub projects

NIAID Data Ecosystem2026-03-10 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.6078%252FD14X0T

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains the repository data used for our study "A Large-Scale Study of Modern Code Review and Security in Open Source Projects". This dataset was collected from GitHub, and includes 3,126 projects in 143 languages, with 489,038 issues and 382,771 pull requests. We also include the regression analysis notebooks for reproducing our results from this data. Methods We pulled from the sub-population of GitHub repositories that had at least 10 pushes, 5 issues, and 4 contributors from 2012 to 2014. We used the GitHub Archive, a collection of all public GitHub events, to generate a list of all such repositories. This gave us 48,612 candidate repositories in total. From this candidate set, we randomly sampled 5000 repositories. We wrote a scraper to pull all non-commit data (such as descriptions and issue and pull request text and metadata) for a GitHub repository through the GitHub API, and used it to gather data for each repository in our sample. After scraping, we had 4,937 repositories (due to some churn in GitHub repositories). For each language used by each repository, we manually labeled it on two independent axes: whether it was a programming language, and whether it is memory-safe. We used two quantification models (as explained in our paper) to estimate the number of issues in each repository that were security bugs. The results of each are in separate dataset files (`repos_data_nn.csv` and `repos_data_rfcc.csv`).

创建时间：

2017-08-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集