Experiment results for the thesis titled "FasterSparql: An architecture for query mediation over loosely coupled federations of knowledge graphs"

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/13960677

下载链接

链接失效反馈

官方服务：

资源简介：

fastersparql experiments Directory structure: results: files generated as result of benchmark $machine/current: JMH logs and JMH .json files of experiments that were not profiled $machine/current-profiler: .jfr files of experiments re-executed under async-profiler $machine.csv: CSV file aggregating all JMH .json results in $machine/current $machine-profiler.csv: CSV file aggregating all JMH .json results for executions in $machine/current-profiler tasks.csv: CSV where each column correspond to a task, as defined in class Jfr2Csv in the fastersparql-microbench module scripts: config.sh: configurations for scripts download-lrb-*.sh: downloads a subset of indices which are used to run experiments. run.sh: executes all experiments, without profiling run-profiler.sh: executes all experiments where the federation members and the mediator are hosted in the same JVM, under async-profiler skel: initial contents of /home/user for a new VM .bashrc: default from Ubuntu 24.04 with modified $PATH .tmux.conf: changes C-b to C-a apps/: ready-to-use JDK, node and Comunica fs-experiments: this is where init-vm.sh will place the contents of scripts, and also where experiments results will be stored, in $(date -I)-results-${run_script_name} directories lrb: all datasets (HDT/Virtuoso/TDB/MMFS are stored here) Reproducing the experiments Step 1: Acquire a VM These experiments were done on Ubuntu 24.04 VMs and on physical machines running a current stable arch (September 2024). Ubuntu 20.04 also should work, as well as any contemporary Linux distribution, including distributions without systemd. Windows and macOS are not supported. While all involved software used in these experiments is multiplatform, the scripts use Linux-specific commands. Step 2: Create a non-root user with sudo permission. Some scripts will issue sudo commands, for which you can manually type the password. Alternatively, you can allow specific commands to not require passwords, by adding this to the sudoers file: # run-profiler.sh will temporarily allow kernel profiling and visibility of # (hashed) kernel pointers %sudo ALL=(ALL) NOPASSWD: /usr/sbin/sysctl # hint: ./run.sh && sudo shutdown -h now will save some dollars %sudo ALL=(ALL) NOPASSWD: /usr/sbin/shutdown # sudo apt will only be invoked from inti-vm.sh %sudo ALL=(ALL) NOPASSWD: /usr/bin/apt Always edit the sudoers file with visudo. Step 4: With access to a fresh VM with hostname vm and a sudoer user user, run in the local machine: scripts/init-vm.sh user@vm This will upload the skel and copy the scripts/ contents into ~/fs-experiments/ Step 5: Download/build datasets. There are four sources for datasets: download-lrb-* scripts will download HDT and TDB datasets from Zenodo Virtuoso dumps must be downloaded from the links provided by LargeRDFBench LargeRDFBench-all.tdb must be built because it is too large for Zenodo Be polite when using download-lrb-* scripts, avoid too many concurrent downloads and prefer downloading into one machine you control and the rsyncing into the other machines. To build LargeRDFBench-all.tdb: Convert the LargeRDFBench-all.hdt file into a gzipped N-Triples file: hdt2rdf -f ntriples LargeRDFBench-all.hdt - | gzip -c > LargeRDFBench-all.nt.gz Build the TDB database on a POSIX system (only Linux was tested): tdb2.xloader --loc LargeRDFBench-all.tdb LargeRDFBench-all.nt.gz Building the TDB database will take 6~8 hours and will use 50% of physical RAM. This should be done on a server or VM instead of a workstation, since while the remainder 50% will not be directly used during the bulk load, the unused memory will be used for the Linux kernel page cache. Step 6: customize the run.sh and run-profiler.sh scripts, ideally create a copy of those scripts with a descriptive name. Typical customization would be to remove/comment some sets of experiments, such as skipping experiments with TDB or not running B queries. The run.sh and run-profiler.sh scripts group all experiments using the same set of files to re-use the Linux kernel page cache. The use of warm-up iterations should be enough to ensure reasonable I/O caching. However, doing this at the launcher script level avoids situations where, TDB fails to completely warm-up during warm-up iterations. The scripts and benchmark code strive to provide an equally optimistic assessment of all evaluated components. Thus, it tries to preserve the page cache instead of explicitly dropping caches before each experiment. Dropping I/O caches does provide interesting results but is not realistic since triple stores, like any database, are often long-lived processes, which actively attempt to benefit from the Linux kernel page cache. experiments are parameterized. Therefore each run_bench describes a set of experiments by the combination of parameter values given as arguments to the run_bench bash function. Each experiment executes on a JVM fork, so that it is not influenced by garbage and configurations changes made by a previous experiment. The core of experiment scripts are calls such as the following: (FORKS=2 run_bench "$RESULTS_DIR" "C$i" FS_JSON_EMIT,FS_JSON_IT $FLOWS $B_JSON,CA 5min 5) Where: "C$i" is the query set that will be executed in this set of experiments. $i will assume values 1 to 10 and .* FS_JSON_EMIT,FS_JSON_IT is a ,-separated list of values from the SourceKind enum in fastersparql-bench, in this case there will be experiments using a federation of NettySparqlServer SPARQL endpoints using the JSON results serialization and in some experiments the NettySparqlServer will drain from the underlying MMFS triple store (a fastersparql instance over a single MMFS directory) using an iterator and in other experiments using an emitter $FLOWS, which is defined as EMIT,ITERATE sets which flow models should be used by the drainer which will drain the queries in the query set from the federation $B_JSON,CA, which expands into as COMPRESSED,TERM_NI,CA defines which batch types must be used when draining queries 5min is the iteration timeout: If this is reached, the query being currently mediated is cancelled. If cancellation is not completed in 1 minute, the entire experiment is aborted (i.e., there is a deadlock/hang). 5 is the number of iterations per forked JVM FORKS=2 defines the number of forked JVM that will be spawned for each combination of experiment parameters. In this case, there will be two forks, which will execute sequentially, leading to a total of 10 iterations, all executed sequentially Step 7: run the experiments, for example: ./run.sh ; ./run-profiler.sh ; (sleep 30s && sudo shutdown -h now) This will generate two new directories on ~/fs-experiments: $(date -I)-run: logs and results from ./run.sh $(date -I)-run-profiler: logs and results from ./run-profiler.sh Step 8: copy the result directories into results/$machine/current and results/$machine/current-profiler in your local machine (since the VM is disposable). The $(date -I)-run-profiler directory can have 4 GiB or more, due to the .jfr files collected by async-profiler. Processing experiment results To process experiment results, use scripts/process-results.sh: ./scripts/process-results.sh results This will take more than 29 hours of 100% CPU usage on all cores with consistently high CPU temperatures (~90º C). Consider running this on a VM and use cpupower to prevent hardware damage: cpupower frequency-set --max 3.5GHz The result of this processing will be a $machine.csv and a $machine-profiler.csv files in results/. These files describe and provide the average invocation time of each iteration executed during experiments. Note that experiments have the following containment relationships: run.sh issues multiple calls to: run_bench, which for each combination of parameter values will execute one: experiment, which consists of one or more: forked JVMs, each performing 3 warm-up iterations, and n (i.e. 5 or 10) measurement iterations, where each iteration has an unbounded number of invocations, where each invocation drains all results of all queries in the query set of the experiment. The st.csv file was generated by thesis.Rmd after processing all iterations on all $machine.csv files. st.csv contains the bootstrapped mean and its 95% confidence interval considering all iterations in all experiments.

创建时间：

2025-02-01