five

Replication Package for Paper "How Early Participation Determines Long-Term Sustained Activity in GitHub Projects"

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/7059029
下载链接
链接失效反馈
官方服务:
资源简介:
This replication package can be used for replicating results in the paper. It contains 1) a dataset of 290,255 repositories; and 2) Python scripts for training and interpreting models.  We recommend manually setup the required environment in a commodity Linux machine with at least 1 CPU Core, 8GB Memory and 100GB empty storage space. We conduct development and execute all our experiments on a Ubuntu 20.04 server with two Intel Xeon Gold CPUs, 320GB memory, and 36TB RAID 5 Storage. We use GHTorrent to restore historical states of 290,255 repositories with more than 57 commits, 4 PRs, 1 issue, 1 fork and 2 stars. The raw data of repositories are stored in `Replication Package/data/prodata.pkl`, and the contribution of features resulting from LIME model is stored in `Replication Package/data/limeres_m2_k1.pkl`. We sort items by the order in `Replication Package/data/randind.npy`, which can be used to reproduce the same results as in the paper.  `Replication Package/data/X_test_m2_k1.pkl` and `Replication Package/data/y_test_m2_k1.pkl` store the test dataset for the LIME model. You can run `Replication Package/fitdata.py` to get the results in Table III and IV, run `Replication Package/draw_compare_variable.py` to get Figure 2 and run `Replication Package/allvari_statistics.py` to get Table II. In `Replication Package/Variable_comparison_with_different_parameter.pdf`, we show the LIME results under different parameters. In `Replication Package/sample_pros.csv`, we also provide the list of randomly selected repositories in Section III.B.
创建时间:
2022-09-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作