Replication data and online supplement for: Underproduction: An Approach for Measuring Risk in Open Source Software
收藏DataCite Commons2025-05-12 更新2025-05-17 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/PUCD2P
下载链接
链接失效反馈官方服务:
资源简介:
These materials were produced as part of:
<p>
Champion, Kaylea and Benjamin Mako Hill. (2021) "Underproduction: An approach for measuring risk in open source software.'' 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Preprint: https://arxiv.org/abs/2103.00352. DOI: 10.1109/SANER50967.2021.00043
<p>
In this archive, you'll find:
<ul>
<li> <b>inst_all_packages_full_results.tab</b> Summary data on all packages as they appear in the paper. This is the place to look if you want to examine the underproduction factor associated with each package
<li> <b>
inst_all_packages_full_results-DESCRIPTION.txt </b> A description of the fields in the inst_all_packages_full_results.tab file.
<li> <b>R_Code.tar.gz</b> Containing R code to reproduce figures and tables from fitted Bayesian hierarchical survival models:
<ul>
<li> dfPrep.R, used to create datasets_for_modeling.RData
<li> models.R, a resource for model information
<li> model_visualization.R, the core code for presenting fitted models and relationships
<li> standalone_dsp.R, descriptive statistics
<li> standalone_bayes.R, to produce tables for the paper
<li> lib-00-utils.R, some utility functions
</ul>
<li><b> datasets_for_modeling.RData</b>, the core dataset used for this analysis
<li><b>Stan.tar.gz</b>, a directory of STAN model output; on our supercomputing node these took multiple days to run and converge
<li><b>Figures.tar.gz</b>, a directory of figures from the paper
<li><b>Raw_Data_Parsers.tar.gz</b>, a directory of both the raw data and the parsers used to obtain the raw data. The dir contains a HowTo file if you would like to reproduce the scraping/cloning part of the project, however note that the original analysis included an rsync copy of the Debian bug database; if you conduct an analysis from scratch, the data you obtain will have changed since our rsync.
<li><b>Appendix.tar.gz</b>, containing figures and data associated with our appendix using an alternate measure of importance ("vote" which represents recent usage but omits packages where usage does not update atime; the paper used "inst")
<ul>
<li>appendix_with_vote.R, the code
<li>appendix_figures, a directory of figures similar to those in the paper but produced for the appendix
<li>vote_all_packages_full_results.csv -- summary data on all packages
<li>vote_all_packages_full_results.csv.DESCRIPTION A description of the fields in the inst_all_packages_full_results.csv file.
</ul>
</ul>
<p>
For more information, please contact: <br>
Kaylea Champion (she/her)<br>
<i>kaylea@uw.edu</i> | <i>khascall@gmail.com</i><br>
@kayleachampion
<p>
<i>Abstract:</i><br>
The widespread adoption of Free/Libre and Open Source Software (FLOSS) means that the ongoing maintenance of many widely used software components relies on the collaborative effort of volunteers who set their own priorities and choose their own tasks. We argue that this has created a new form of risk that we call `underproduction' which occurs when the supply of software engineering labor becomes out of alignment with the demand of people who rely on the software produced. We present a conceptual framework for identifying relative underproduction in software as well as a statistical method for applying our framework to a comprehensive dataset from the Debian GNU/Linux distribution that includes 21,902 source packages and the full history of 461,656 bugs. We draw on this application to present two experiments: (1) a demonstration of how our technique can be used to identify at-risk software packages in a large FLOSS repository and (2) a validation of these results using an alternate indicator of package risk. Our analysis demonstrates both the utility of our approach and reveals the existence of widespread underproduction in a range of widely-installed software components in Debian.
提供机构:
Harvard Dataverse
创建时间:
2021-01-12



