Geographic Diversity in Public Code Contributions — Replication Package

Mendeley Data2024-06-25 更新2024-06-27 收录

下载链接：

https://zenodo.org/records/6390355

下载链接

链接失效反馈

官方服务：

资源简介：

Geographic Diversity in Public Code Contributions - Replication Package This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471 This document comes with the software needed to mine and analyze the data presented in the paper. Prerequisites These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2 Initial data swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica. names.tab - forenames and surnames per country with their frequency zones.acc.tab - countries/territories, timezones, population and world zones c_c.tab - ccTDL entities - world zones matches Data preparation Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst Zone detection by email Run the email detection script to create author-country-by-email.tab.zst sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst Database creation and initial data ingestion Create the PostgreSQL DB sh> createdb zones-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database. Import data into PostgreSQL DB sh> ./import_data.sh Zone detection by name Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script sh> psql -f extract_commits.sql zones-commit Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters. Ingest zones assignment data into the DB psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$''' Extraction and graphs Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …). sh> ./extract_data.sh Run the script to create the graphs from all the previously extracted tabfiles. sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf

开源代码贡献的地理多样性——复现包本文档详述如何复现下述论文的研究成果：Davide Rossi与Stefano Zacchiroli，2022年，《开源代码贡献的地理多样性——一项跨越50年的大规模探索性研究》。发表于第19届国际软件仓库挖掘会议（MSR ’22），2022年5月23-24日，美国宾夕法尼亚州匹兹堡市。美国计算机协会（ACM），纽约州纽约市，共5页。DOI链接：https://doi.org/10.1145/3524842.3528471 本文档附带了用于挖掘与分析本论文中呈现数据的配套软件。 ## 前置条件本操作指南假设使用Bash Shell、Python编程语言、PostgreSQL数据库管理系统（版本11及以上）、zstd压缩工具，以及各类常用类Unix（*nix）Shell工具（如cat、pv等），上述工具均支持多种架构与操作系统。建议创建Python虚拟环境，并安装以下PyPI软件包： click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2 ### 初始数据 swh-replica：包含Software Heritage（软件遗产档案馆）数据副本的PostgreSQL数据库。该数据库的架构可访问：https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/。本次研究与软件遗产档案馆运营方合作，于2021年7月7日获取了该档案馆的快照数据，并从中提取了所需信息。由于数据体量庞大且包含用户电子邮箱地址等个人信息，本复现包无法完整提供该数据集。不过，可按照下述论文所述方法，从Software Heritage档案馆数据集中获取等效的脱敏（移除电子邮箱地址）数据：Antoine Pietri、Diomidis Spinellis、Stefano Zacchiroli，《软件遗产图谱数据集：一站式公共软件开发资源》。发表于MSR 2019：第16届国际软件仓库挖掘会议论文集，2019年5月，加拿大蒙特利尔，第138-142页，IEEE 2019。DOI链接：http://dx.doi.org/10.1109/MSR.2019.00030。获取数据后，可将其导入PostgreSQL数据库以填充`swh-replica`。 #### 配套数据文件 - `names.tab`：按国家分类的人名（名与姓）及其频率分区 - `zones.acc.tab`：国家/地区、时区、人口与世界分区 - `c_c.tab`：ccTDL实体——世界分区匹配表 ## 数据准备 1. 从`swh-replica`数据库导出数据，生成`commits.csv.zst`与`authors.csv.zst`： bash sh> ./export.sh 2. 运行作者信息清理脚本，生成`authors--clean.csv.zst`： bash sh> ./cleanup.sh authors.csv.zst 3. 过滤掉不合理的姓名，生成`authors--plausible.csv.zst`： bash sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst ## 基于电子邮箱的地域检测运行电子邮箱检测脚本，生成`author-country-by-email.tab.zst`： bash sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst ## 数据库创建与初始数据导入 1. 创建PostgreSQL数据库： bash sh> createdb zones-commit 注意：此后若前缀带有`psql>`提示符，则表示在`zones-commit`数据库中执行psql命令。 2. 将数据导入PostgreSQL数据库： bash sh> ./import_data.sh ## 基于姓名的地域检测 1. 从数据库中提取提交数据，生成`commits.tab`，作为地域检测脚本的输入文件： bash sh> psql -f extract_commits.sql zones-commit 2. 运行世界地域检测脚本，生成`commit_zones.tab.zst`： bash sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst 如需修改脚本参数，可运行`./assign_world_zone.py --help`查看帮助。 3. 将地域分配数据导入数据库： bash psql> copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''s$''' ## 数据提取与图表生成 1. 运行脚本执行数据库查询，提取用于绘图的数据，生成`commit_zones_7120.tab`、`author_zones_7120_t5.tab`、`commit_zones_7120.grid`与`author_zones_7120_t5.grid`。如需修改提取参数（起始/结束年份、采样规则等），可编辑`extract_data.sql`文件。 bash sh> ./extract_data.sh 2. 运行脚本，基于此前提取的所有制表文件生成图表： bash sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf

创建时间：

2023-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集