Data from: Memory-bound k-mer selection for large and evolutionary diverse reference libraries
收藏DataCite Commons2025-06-01 更新2025-06-15 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.0000000c2
下载链接
链接失效反馈官方服务:
资源简介:
Using k-mers to find sequence matches is increasingly used in many
bioinformatic applications, including metagenomic sequence classification.
The accuracy of these down-stream applications relies on the density of
the reference databases, which, luckily, are rapidly growing. While the
increased density provides hope for dramatic improvements in accuracy,
scalability is a concern. Reference k-mers are kept in the memory during
the query time, and saving all k-mers of these ever-expanding databases is
fast becoming impractical. Several strategies for subsampling have been
proposed, including minimizers and finding taxon-specific k-mers. However,
we contend that these strategies are inadequate, especially when reference
sets are taxonomically imbalanced, as are most microbial libraries. In
this paper, we explore approaches for selecting a fixed-size subset of
k-mers present in an ultra-large dataset to include in a library such that
the classification of reads suffers the least. Our experiments demonstrate
the limitations of existing approaches, especially for novel and poorly
sampled groups. We propose a library construction algorithm called KRANK
(K-mer RANKer) that combines several components, including a hierarchical
selection strategy with adaptive size restrictions and an equitable
coverage strategy. We implement KRANK in highly optimized code and combine
it with the locality-sensitive-hashing classifier CONSULT-II to build a
taxonomic classification and profiling method. On several benchmarks,
KRANK k-mer selection dramatically reduces memory consumption with minimal
loss in classification accuracy. We show in extensive analyses based on
CAMI benchmarks that KRANK outperforms k-mer-based alternatives in terms
of taxonomic profiling and comes close to the best marker-based methods in
terms of accuracy.
提供机构:
Dryad
创建时间:
2024-09-10



