pileup-hi analysis and inputs: part 2
收藏Zenodo2026-04-16 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.19613933
下载链接
链接失效反馈官方服务:
资源简介:
Part 2 of the pileup-hi analysis and inputs dataset: see below for instructions and file descriptions.
## Prelude
- pileup-hi version 0.9.2 was used for all analysis. You can download it using Cargo:
```bashcargo install pileup-hi --version 0.9.2```
You can also grab the pre-compiled binaries from the [original repo's release page](https://github.com/epiliper/pileup-hi/releases/tag/0.9.2).
- All testing was performed on MacOS Sequoia 15.7.4- this analysis requires 1.5TB+ of disk space.
## Other software used- samtools 1.2.3- htslib 1.2.3- perbase 1.2.0- b3sum 1.8.3- xsra 0.2.27- minimap2 v2.30-r1290-dirty- python 3.14.2- R and Rstudio (along with packages specified in .Rmd files)
**NOTE:** for the instructions below, it is assumed that you have all the software listed somewhere in `$PATH`. For information on how to move software to `$PATH`, see [this thread](https://unix.stackexchange.com/questions/183295/adding-programs-to-path).
## Overall description - generating data
Analysis consisted of running different pileup programs on five datasets. This was done 3 python scripts that can be adjusted to run a selection of tools on a selection of datasets. By default: they are configured to generate data for the entire paper.
These scripts are described in detail below:
### bench.py: run time and peak memory usage
This script launches tools on specified input files and records performance information to a spreadsheet `./reports/`.
The script is configured by default to run all conditions on all files in triplicate, but you can modify this by changing the following variables:
change iterations:```pythonNUM_ITERATIONS = 3```
change software/ output mode/ thread count:```python## tuple of command, output mode, threadcount (where applicable)METHODS = [
# ## Pileup Mode (run_mpileup, "plp", 1),
(run_pileuphi, "plp", 1), (run_perbase, "plp", 1), (run_parampileup, "plp", 1),
(run_pileuphi, "plp", 4), (run_perbase, "plp", 4), (run_parampileup, "plp", 4),
(run_pileuphi, "plp", 8), (run_perbase, "plp", 8), (run_parampileup, "plp", 8),
(run_pileuphi, "plp", 12), (run_perbase, "plp", 12), (run_parampileup, "plp", 12),
## Nucleotide frequency mode (run_pileuphi, "histo", 1), (run_pileuphi, "histo", 4), (run_pileuphi, "histo", 8), (run_pileuphi, "histo", 12), ]
```
change files to run on:```pythonFILES = [ "DRR793869_hg38.bam", "SRR19895870.bam", "SRR36374445_hg38.bam", "SRR30646149_hg38.bam", "ERR2756169_merged.bam" ]```
Once you've adjusted this to your liking, run the following to gather benchmarking data:```bashpython3 bench.py```
### compare_output.py: output file hash calculation
This script is adjustable similarly to `bench.py` (see above), except `METHODS` differs slightly in structure:```python# tuple of run func, ouptut mode, and threadsMETHODS = [ ("mpileup", run_mpileup, "plp", 1, ""),
("pileup-hi", run_pileuphi, "plp", 1),
("parallel mpileup", run_parampileup, "plp", 4), ("pileup-hi", run_pileuphi, "plp", 4),
("parallel mpileup", run_parampileup, "plp", 8), ("pileup-hi", run_pileuphi, "plp", 8),
("parallel mpileup", run_parampileup, "plp", 12), ("pileup-hi", run_pileuphi, "plp", 12) ]```
to run this script: ```bashpython3 compare_output.py```
### compare_size.py: compare output size differences between `pileup-hi`'s 'histo' and 'plp' output modes.
See the previous two sections for what parameters to adjust. This script will output to a sphreadsheet prefixed by `./size_comp*`.
To run: ```bashpython3 compare_size.py```
## Other files- `get_metrics.sh`: calculate alignment metrics such as depth, coverage, etc.- `hashes_2026Feb27.csv`: output file hashes comparing pileup-hi and samtools mpileup- `size_comp_2026Mar31.csv`: output file sizes comparing pileup-hi 'plp' and 'histo' output modes- `bench_report_2026Mar30.csv`: benchmark data- `para_mpileup.sh`: the parallel shell wrapper around samtools mpileup used in benchmarking. See the beginning of the script for instructions on usage.- `bench.Rmd`: figure generation script- `aln.sh`: script to generate BAMs from downloaded FASTQ data- `dl.sh`: script to download FASTQs from the SRA
## Contactreach out to either:- epil02 #(at)# uw #(dot)# edu- agrening #(at)# uw #(dot)# edu
提供机构:
Zenodo
创建时间:
2026-04-16



