Binder Launch Records
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4891790
下载链接
链接失效反馈官方服务:
资源简介:
The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here, and a database version is available here. This repo includes a database dump with records from 2018-11-03 to 2021-01-21. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The database is stored in SQLite format (binder-launches.sqlite), with the following schema for the main table (binder).
CREATE TABLE [binder] (
[timestamp] TEXT,
[version] INTEGER,
[provider] TEXT,
[spec] TEXT,
[ref] TEXT,
[origin] TEXT,
[repo] TEXT,
[resolved_ref] TEXT,
[org] TEXT
);
timestamp is the ISO timestamp of the launch
provider gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the explanations assume GitHub, other providers may differ.
spec gives a unique-ish identifier for the particular branch/release being built. It consists of //.
ref, repo, and org provide the same info, but split up differently (repo includes both the user ID/org and the actual repo name). These may be removed in a future release of this dataset, and shouldn't be used.
origin indicates which backend was used. Each has its own storage, compute, etc. so this info might be important for evaluating caching and performance.
resolved_ref specifies the git commit that was actually used, rather than the reference name. Note that this info was not recorded from the beginning, so only the more recent entries include it.
The Binder launch dataset identifies the source repos that were used, but doesn't give any indication of their contents. We crawled GitHub to get the actual specification files in the repos which were fed into repo2docker when preparing the notebook environments, as well as filesystem metadata of the repos. Some repos were deleted/made private at some point, and were thus skipped. The results are in binder-specs.sqlite. The schema is as follows.
CREATE TABLE specs (
ok BOOLEAN DEFAULT FALSE,
remote TEXT NOT NULL,
git_ref TEXT,
git_commit TEXT,
apt TEXT,
conda TEXT,
pip TEXT,
pipfile TEXT,
docker TEXT,
setup TEXT,
julia TEXT,
r TEXT,
nix TEXT,
postbuild TEXT,
start TEXT,
runtime TEXT,
ls text,
resolved_commit TEXT,
PRIMARY KEY(remote, git_ref, git_commit));
The ok field indicates whether the repo was cloned successfully. It's probably fine to exclude any entries where ok is false from any processing. Here remote corresponds to repo in the launch database, git_ref to ref, and git_commit to resolved_ref. On newer records where the original dataset includes the resolved commit, it is included as part of the primary key. Note that since not all launch records include the resolved git commit, we couldn't simply use that as the primary key. In either case, the commit that was actually cloned is recorded in resolved_commit. For each repo, we collected spec files into the following fields (see the repo2docker docs for details on what these are). The records in the database are simply the verbatim file contents, with no parsing or further processing performed.
conda: environment.yml
pip: requirements.txt
apt: apt.txt
pipfile: Pipfile.lock or Pipfile
docker: Dockerfile
setup: setup.py
julia: Project.toml or REQUIRE
r: install.R
nix: default.nix
postbuild: postBuild
start: start
runtime: runtime.txt
The ls field gives a metadata listing of the repo contents (excluding the .git directory). This field is JSON encoded with the following structure based on JSON types:
Object: filesystem directory. Keys are file names within it. Values are the contents, which can be regular files, symlinks, or subdirectories.
String: symlink. The string value gives the link target.
Number: regular file. The number value gives the file size in bytes.
创建时间:
2021-09-05



