five

Supplementary data for CRISPR spacer-protospacer matching benchmarks

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15171877
下载链接
链接失效反馈
官方服务:
资源简介:
Raw Outputs for CRISPR Spacer Matching Benchmark Study This dataset contains the raw outputs from sequence alignment tools used in benchmarking protospacer identification. The data is organized into two main categories: simulated data (random sequences) and real data (IMG/VRv4). These are separated into two tar.zst files for ease of download (see their treebelow). For more information or for the exact commands used to run the tools, please see the gitlab repository folder "tool_configs". Simulted  Run Directory Naming Convention:  `run_t_{threads}_nc_{n_contigs}_ns_{n_spacers}_ir_{min_insertions}_{max_insertions}_lm_{min_mismatches}_{max_mismatches}_prc_{prop_rc}` Where:- t: Number of threads used- nc: Number of contigs generated- ns: Number of spacers generated- ir: Insertion range (min and max insertions per spacer - a range to simulte number of total number of occurences in the reference (contig) file)- lm: Length/mismatch range (min and max mismatches allowed)- prc: Proportion of reverse complement insertions Compressed Formats: FASTA files (.fa.gz): Compressed with bgzip. To decompress you can use gunzip -c file.fa.gz > file.fa SAM files (.sam.gz): Compressed with bgzip   gunzip -c file.sam.gz > file.sam (note that not all tools tested conformed to SAM v1.4, or had the extended CIGAR string). TSV files (.tsv.zst): Compressed with zstd, decompress with  zstd -d file.tsv.zst(note that blastn and mmseqs results used a slightly modified "m6" output format ("qaccver", "saccver", "nident", "length", "mismatch", "qlen", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"). General results/performence files: tools_results.tsv.zst`: Raw alignment results from each tool hyperfine_results.tsv.zst: per tool runtime as captured by hyperfine. performance_results.tsv.zst`: Precision, recall, and F1 scores - note, these values migth slightly differ from the ones in the manuscript or the jupyter notebooks - these use a modified defintion of true positives which is more permissive. It is not used in the manuscript or the notebooks, but used as a quick and dirty proxy.  tool_performance_stats_mismatches_*.tsv: Performance breakdown by mismatch level (note that it could be == mismatch or mismatch >= value, i.e. up to mismatch to exactly n mismatch). tool_performance_by_mismatches.json: like before but usually after some aggregation (into 1 file, in a narrow table format). Tools tested for all data Bowtie1 v1.3.1 (64-bit, gcc 13.3.0) Bowtie2 v2.5.4 (64-bit, gcc 13.3.0) BBTools (bbmap-skimmer) v39.13 StrobeAlign v0.15.0 BLASTN v2.16.0 (build Dec 14 2024 23:05:40) MMseqs2 db8ad2d14d0a285ce0ad62bbefd0dce927663315 MUMMER v4.0.1 minimap2 2.28-r1209 spacer-containment v0.1.0 LexicMap v0.5.0 (06741c8) Tools tested for simulated data only BWA 0.7.19-r1273 HISAT2 v2.2.1 (64-bit) Directory Structure Note - once decompreseed, the structure of the different simulations runs is the same, so in the tree below I only included the subdirectory tree for one such run.Note2 - the simulated data contains a "combined_sims" folder - this is an aggregation of the individual runs, and is the main data used in the Performance_simulated_combined.ipynb jupyter/python notebook in the gitlab repo.    Simulated Data: simulated├── Runs│   ├── combined_sims│   │   ├── simulated_data│   │   │   ├── ground_truth.tsv.zst│   │   │   ├── simulated_contigs.fa.gz│   │   │   └── simulated_spacers.fa.gz│   │   └── tools_results.tsv.zst│   └── sims│       ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_0_0_prc_0.5│       │   ├── bash_scripts│       │   │   ├── bbmap_skimmer.sh│       │   │   ├── bbmapskimmermod.sh│       │   │   ├── blastn.sh│       │   │   ├── bowtie1.sh│       │   │   ├── bowtie2.sh│       │   │   ├── bwa_mem.sh│       │   │   ├── hisat2.sh│       │   │   ├── lexicmap.sh│       │   │   ├── minimap2.sh│       │   │   ├── minimap2_mod.sh│       │   │   ├── minimap2_og.sh│       │   │   ├── mmseqs.sh│       │   │   ├── mmseqs2_map.sh│       │   │   ├── mummer4.sh│       │   │   ├── spacer_containment.sh│       │   │   └── strobealign.sh│       │   ├── hyperfine_results.tsv│       │   ├── performance_results.tsv│       │   ├── raw_outputs│       │   │   ├── bbmap_skimmer.sh.json│       │   │   ├── bbmap_skimmer_mod_output.sam.gz│       │   │   ├── bbmap_skimmer_output.sam.gz│       │   │   ├── bbmapskimmermod.sh.json│       │   │   ├── blastn.sh.json│       │   │   ├── blastn_output.tsv.zst│       │   │   ├── bowtie1.sh.json│       │   │   ├── bowtie1_output.sam.gz│       │   │   ├── bowtie2.sh.json│       │   │   ├── bowtie2_output.sam.gz│       │   │   ├── bwa_mem.sh.json│       │   │   ├── bwa_mem_output.sam.gz│       │   │   ├── hisat2.sh.json│       │   │   ├── hisat2_output.sam.gz│       │   │   ├── hyperfine_output_bbmap_skimmer.sh.txt│       │   │   ├── hyperfine_output_bbmapskimmermod.sh.txt│       │   │   ├── hyperfine_output_blastn.sh.txt│       │   │   ├── hyperfine_output_bowtie1.sh.txt│       │   │   ├── hyperfine_output_bowtie2.sh.txt│       │   │   ├── hyperfine_output_bwa_mem.sh.txt│       │   │   ├── hyperfine_output_hisat2.sh.txt│       │   │   ├── hyperfine_output_lexicmap.sh.txt│       │   │   ├── hyperfine_output_minimap2.sh.txt│       │   │   ├── hyperfine_output_minimap2_mod.sh.txt│       │   │   ├── hyperfine_output_minimap2_og.sh.txt│       │   │   ├── hyperfine_output_mmseqs.sh.txt│       │   │   ├── hyperfine_output_mmseqs2_map.sh.txt│       │   │   ├── hyperfine_output_mummer4.sh.txt│       │   │   ├── hyperfine_output_spacer_containment.sh.txt│       │   │   ├── hyperfine_output_strobealign.sh.txt│       │   │   ├── lexicmap.sh.json│       │   │   ├── lexicmap_output.tsv.zst│       │   │   ├── minimap2.sh.json│       │   │   ├── minimap2_mod.sh.json│       │   │   ├── minimap2_mod_output.sam.gz│       │   │   ├── minimap2_og.sh.json│       │   │   ├── minimap2_og_output.sam.gz│       │   │   ├── minimap2_output.sam.gz│       │   │   ├── mmseqs.sh.json│       │   │   ├── mmseqs2_map.sh.json│       │   │   ├── mmseqs_output.tsv.zst│       │   │   ├── mmseqsmap_output.tsv.zst│       │   │   ├── mummer4.sh.json│       │   │   ├── mummer4_output.sam.gz│       │   │   ├── spacer_containment.sh.json│       │   │   ├── spacer_containment_output.tsv.zst│       │   │   ├── strobealign.sh.json│       │   │   └── strobealign_output.sam.gz│       │   ├── simulated_data│       │   │   ├── ground_truth.tsv.zst│       │   │   ├── simulated_contigs.fa.gz│       │   │   └── simulated_spacers.fa.gz│       │   └── tools_results.tsv.zst│       ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_1_1_prc_0.5/...│       ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_2_2_prc_0.5/...│       ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_3_3_prc_0.5│       ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_0_0_prc_0.5│       ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_1_1_prc_0.5│       ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_2_2_prc_0.5│       └── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_3_3_prc_0.5├── plots│   ├── matrix_0.html│   ├── matrix_0.svg│   ├── matrix_1.html│   ├── matrix_1.svg│   ├── matrix_2.html│   ├── matrix_2.svg│   ├── matrix_3.html│   ├── matrix_3.svg│   ├── matrix_4.html│   ├── matrix_4.svg│   ├── matrix_5.html│   ├── matrix_5.svg│   ├── tool_performance_by_mismatches.html│   ├── tool_performance_by_mismatches.json│   ├── tool_performance_grid.html│   ├── tool_performance_grid.svg│   ├── tool_performance_mismatches_0.pdf│   ├── tool_performance_mismatches_1.pdf│   ├── tool_performance_mismatches_2.pdf│   ├── tool_performance_mismatches_3.pdf│   ├── tool_performance_stats_mismatches_0.tsv│   ├── tool_performance_stats_mismatches_1.tsv│   ├── tool_performance_stats_mismatches_2.tsv│   ├── tool_performance_stats_mismatches_3.tsv│   ├── tool_performance_vs_mismatches.html│   ├── tool_performance_vs_mismatches.svg│   ├── tool_recall_per_spacer_contig_grid_log.pdf│   └── tool_recall_per_spacer_only_grid_log.pdf└── results    ├── aggregated_ground_truth.parquet    ├── aggregated_performance_runtime.parquet    ├── aggregated_runtimes.parquet    ├── aggregated_tool_results.parquet    ├── matrix_0.tsv    ├── matrix_1.tsv    ├── matrix_2.tsv    ├── matrix_3.tsv    ├── matrix_4.tsv    ├── matrix_5.tsv    └── tool_performance_by_mismatches.tsv  Real Data: real_data├── bash_scripts│   ├── bbmap_skimmer.sh│   ├── blastn.sh│   ├── bowtie1.sh│   ├── bowtie2.sh│   ├── lexicmap.sh│   ├── minimap2.sh│   ├── mmseqs.sh│   ├── mummer4.sh│   ├── spacer_containment.sh│   └── strobealign.sh├── job_scripts│   ├── bbmap_skimmer.sh│   ├── blastn.sh│   ├── bowtie1.sh│   ├── bowtie2.sh│   ├── lexicmap.sh│   ├── minimap2.sh│   ├── mmseqs.sh│   ├── mummer4.sh│   ├── spacer_containment.sh│   └── strobealign.sh├── plots│   ├── matrix_0.html│   ├── matrix_0.svg│   ├── matrix_1.html│   ├── matrix_1.svg│   ├── matrix_2.html│   ├── matrix_2.svg│   ├── matrix_3.html│   ├── matrix_3.svg│   ├── tool_performance_detailed_3bins.pdf│   ├── tool_performance_detailed_stats.tsv│   ├── tool_performance_max_mm_0_detailed_3bins.pdf│   ├── tool_performance_max_mm_0_detailed_stats.tsv│   ├── tool_performance_max_mm_1_detailed_3bins.pdf│   ├── tool_performance_max_mm_1_detailed_stats.tsv│   ├── tool_performance_max_mm_2_detailed_3bins.pdf│   ├── tool_performance_max_mm_2_detailed_stats.tsv│   ├── tool_performance_max_mm_3_detailed_3bins.pdf│   ├── tool_performance_max_mm_3_detailed_stats.tsv│   ├── tool_performance_mm_0_detailed_stats.tsv│   ├── tool_performance_mm_1_detailed_stats.tsv│   ├── tool_performance_mm_2_detailed_stats.tsv│   ├── tool_performance_mm_3_detailed_stats.tsv│   ├── tool_performance_panel.pdf│   ├── tool_performance_panel.svg│   ├── tool_performance_perfect_detailed_3bins.pdf│   ├── tool_performance_perfect_detailed_stats.tsv│   ├── tool_performance_vs_mismatches.pdf│   ├── tool_performance_vs_occurrences_detailed.pdf│   ├── tool_performance_vs_occurrences_detailed_3bins.pdf│   ├── upset_0.pdf│   ├── upset_1.pdf│   ├── upset_2.pdf│   └── upset_3.pdf├── raw_outputs│   ├── bbmap_skimmer_output.sam.gz│   ├── blastn_output.tsv.zst│   ├── bowtie1_output.sam.gz│   ├── bowtie2_output.sam.gz│   ├── lexicmap_output.tsv.zst│   ├── minimap2_output.sam.gz│   ├── mmseqs_output.tsv.zst│   ├── mummer4_output.sam.gz│   ├── spacer_containment_output.tsv.zst│   └── strobealign_output.sam.gz├── results│   ├── Tool_exclusivity.tsv│   ├── deviation_counts.csv│   ├── matrix_0.tsv│   ├── matrix_1.tsv│   ├── matrix_2.tsv│   ├── matrix_3.tsv│   ├── matrix_4.tsv│   ├── matrix_5.tsv│   ├── spacer_counts_with_tools.parquet│   ├── summary_stats.parquet│   ├── tool_performance_by_mismatches.tsv│   ├── tool_performance_vs_occurrences_detailed_stats.tsv│   └── tools_results_mm_recalced.parquet├── sacct.out└── slurm_logs    ├── bbmap_skimmer-15192666.err    ├── bbmap_skimmer-15192666.out    ├── bowtie1-15296994.err    ├── bowtie1-15296994.out    ├── bowtie2-15192707.err    ├── bowtie2-15192707.out    ├── lexicmap-15192729.err    ├── lexicmap-15192729.out    ├── minimap2-15192728.err    ├── minimap2-15192728.out    ├── mmseqs-15192703.err    ├── mmseqs-15192703.out    ├── mummer4-15192721.err    ├── mummer4-15192721.out    ├── spacer_containment-15224132.err    ├── spacer_containment-15224132.out    ├── strobealign-15192702.err    ├── strobealign-15192702.out    ├── vsearch-15258853.err    └── vsearch-15258853.out
创建时间:
2025-04-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作