Replication Package for Paper "How Early Participation Determines Long-Term Sustained Activity in GitHub Projects"
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/7059029
下载链接
链接失效反馈官方服务:
资源简介:
This replication package can be used for replicating results in the paper. It contains 1) a dataset of 290,255 repositories; and 2) Python scripts for training and interpreting models.
We recommend manually setup the required environment in a commodity Linux machine with at least 1 CPU Core, 8GB Memory and 100GB empty storage space. We conduct development and execute all our experiments on a Ubuntu 20.04 server with two Intel Xeon Gold CPUs, 320GB memory, and 36TB RAID 5 Storage.
We use GHTorrent to restore historical states of 290,255 repositories with more than 57 commits, 4 PRs, 1 issue, 1 fork and 2 stars. The raw data of repositories are stored in `Replication Package/data/prodata.pkl`, and the contribution of features resulting from LIME model is stored in `Replication Package/data/limeres_m2_k1.pkl`. We sort items by the order in `Replication Package/data/randind.npy`, which can be used to reproduce the same results as in the paper.
`Replication Package/data/X_test_m2_k1.pkl` and `Replication Package/data/y_test_m2_k1.pkl` store the test dataset for the LIME model. You can run `Replication Package/fitdata.py` to get the results in Table III and IV, run `Replication Package/draw_compare_variable.py` to get Figure 2 and run `Replication Package/allvari_statistics.py` to get Table II. In `Replication Package/Variable_comparison_with_different_parameter.pdf`, we show the LIME results under different parameters. In `Replication Package/sample_pros.csv`, we also provide the list of randomly selected repositories in Section III.B.
创建时间:
2022-09-08



