Replication Package for Paper "How Early Participation Determines Long-Term Sustained Activity in GitHub Projects"

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/7059029

下载链接

链接失效反馈

官方服务：

资源简介：

This replication package can be used for replicating results in the paper. It contains 1) a dataset of 290,255 repositories; and 2) Python scripts for training and interpreting models. We recommend manually setup the required environment in a commodity Linux machine with at least 1 CPU Core, 8GB Memory and 100GB empty storage space. We conduct development and execute all our experiments on a Ubuntu 20.04 server with two Intel Xeon Gold CPUs, 320GB memory, and 36TB RAID 5 Storage. We use GHTorrent to restore historical states of 290,255 repositories with more than 57 commits, 4 PRs, 1 issue, 1 fork and 2 stars. The raw data of repositories are stored in `Replication Package/data/prodata.pkl`, and the contribution of features resulting from LIME model is stored in `Replication Package/data/limeres_m2_k1.pkl`. We sort items by the order in `Replication Package/data/randind.npy`, which can be used to reproduce the same results as in the paper. `Replication Package/data/X_test_m2_k1.pkl` and `Replication Package/data/y_test_m2_k1.pkl` store the test dataset for the LIME model. You can run `Replication Package/fitdata.py` to get the results in Table III and IV, run `Replication Package/draw_compare_variable.py` to get Figure 2 and run `Replication Package/allvari_statistics.py` to get Table II. In `Replication Package/Variable_comparison_with_different_parameter.pdf`, we show the LIME results under different parameters. In `Replication Package/sample_pros.csv`, we also provide the list of randomly selected repositories in Section III.B.

创建时间：

2022-09-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集