Genomic datasets used for evalution of k-mer representations and indexes
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13997397
下载链接
链接失效反馈官方服务:
资源简介:
This record contains genomic datasets, including subsampled k-mer sets for some datasets (files with names containing _subsampled_). Namely, it provides the following datasets:
Two E. coli pan-genomes, obtained as the union of the E. coli genomes from the 661k collection. One contains all genomes (without quality filtering) and for the other (HQ) we applied high-quality filtering.
S. pneumoniae pan-genome: 616 genomes, as provided in RASE DB S. pneumoniae https://github.com/c2-d2/rase-db-spneumoniae-sparc/
SARS-CoV-2 pan-genome, downloaded from GISAID https://gisaid.org/ (access upon registration) on Jan 25, 2023 (GISAID version 2023/01/23, 14,682,066 genomes, 430 Gbp).
Metagenomic sample SRS063932 (Illumina raw reads) of human microbiome with accession SRX023459, download from https://www.hmpdacc.org/hmp/HMASM/. The fastq files were converted to FASTA files using `seqtk seq -A -C`.
Human RNA-seq Illumina raw reads with accession SRX348811, downloaded using the prefetch tool from the SRA toolkit and then converted into the FASTA format by`fastq-dump --split-3 --fasta`.
Human genome Illumina raw reads with accession SRX016231, downloaded using the prefetch tool from the SRA toolkit and then converted into the FASTA format by`fastq-dump --split-3 --fasta`.
Human genome assembly chm13v2.0 (T2T), downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz.
Two MiniKraken datasets (4GB and 8GB), downloaded from https://ccb.jhu.edu/software/kraken/, with the 31-mers dumped using Jellyfish 1.1.12.
The resulting FASTA files (apart from the human genome assembly chm13v2.0 and MiniKraken datasets) were converted to unitigs by GGCAT v1.1.0by `ggcat build -k {kmer-size} -m 200 -j 5 -s {min-freq} -o {preprocessed_unitigs} {input_FASTA}`, where we used $k=128$ and `{min-freq}`=1 for pan-genomes and $k=32$ and `{min-freq}`=2 for dataset from raw reads.
Finally, the subsampled files `{dataset}_subsampled_k{$k$}_r0.1.fa.xz` contain 10% randomly chosen distinct canonical $k$-mers from the whole $k$-mer set of the given dataset. The FASTA file contains one subsampled k-mer per sequence.
创建时间:
2025-01-23



