Source Code Archiving to the Rescue of Reproducible Deployment — Replication Package
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11243113
下载链接
链接失效反馈官方服务:
资源简介:
Replication package for the paper:
Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622
Generating the paper
The paper can be generated using the following command:
guix time-machine -C channels.scm \
-- shell -C -m manifest.scm \
-- make
This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.
It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):
GNU Make
SQLite 3
GNU AWK
Rubber
Graphviz
TeXLive
Structure
data/ contains the data examined in the paper
scripts/ contains dedicated code for the paper
logs/ contains logs generated during certain computations
Preservation of Guix
Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.
The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.
Analysis
Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.
The pog-types.sql query gives the counts of each source type (e.g. “git” or “tar-gz”) for each commit covered by the database.
The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.
The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.
Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.
Estimating missing sources
The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.
A naïve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ‘guix pull’. To work around this, we also approximate the history of commits available to ‘guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.
This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.
To generate the estimate, use:
guix time-machine -C channels.scm \
-- shell -C -m manifest.scm \
-- make data/missing-sources.txt
If not using Guix, you will need additional software beyond what is used to generate the paper:
GNU Guile
GNU Bash
GNU Mailutils
GNU Parallel
Measuring link rot
In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,
guix repl -q scripts/table-per-origin.scm
When running these scripts for the paper, we tracked their output and saved it inside the logs directory.
创建时间:
2024-05-23



