Memory-bound k-mer selection for large and evolutionary diverse reference libraries
收藏DataONE2025-03-10 更新2025-04-26 收录
下载链接:
https://search.dataone.org/view/sha256:61f907f83929eab655b8804bb25c4ec54abba2f83210ce2d05c355968e0be0ec
下载链接
链接失效反馈官方服务:
资源简介:
Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these down-stream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large dataset to include in a library such that the classification of reads suffers the least. Our experi..., , , # Data from: Memory-bound k-mer selection for large and evolutionary diverse reference libraries
Data belonging to the following paper:
* Åapcı, A. O. B., & Mirarab, S. (2024). *Memory-bound k-mer selection for large and evolutionary diverse reference libraries*. Genome Research.
* Åapcı, A. O. B., & Mirarab, S. (2024). Memory-bound and taxonomy-aware k-mer selection for Ultra-large reference libraries. In J. Ma (Ed.), *Research in Computational Molecular Biology* (pp. 340â343). Springer Nature Switzerland. [https://doi.org/10.1007/978-1-0716-3989-4_26](https://doi.org/10.1007/978-1-0716-3989-4_26)
See [https://ter-trees.ucsd.edu/data/krank/](https://ter-trees.ucsd.edu/data/krank/) for a catalog of libraries, and query reads that we had simulated for benchmarking. Descriptions of libraries and a tutorial can be found in the main GitHub repository.
## KRANK libraries
* `wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz`: WoL-v1 library lightweight (6.25GB) & with random - fast-mode
* `...,
利用k聚体(k-mer)进行序列匹配的方法,正日益广泛应用于诸多生物信息学场景,其中包括宏基因组序列分类任务。这类下游应用的分类精度依赖于参考数据库的覆盖密度,而幸运的是,这类数据库的规模正快速扩张。尽管数据库覆盖密度的提升为分类精度的显著改善带来了可能,但其可扩展性问题却不容忽视。在序列查询阶段,参考k聚体需加载至内存中存储;而随着数据库规模持续扩张,存储全部k聚体正快速变得不切实际。目前已提出多种子采样策略,包括最小化器(minimizers)以及筛选类群特异性k聚体的方法。然而我们认为,这些策略存在明显局限,尤其当参考数据集的类群分布不均衡时——正如绝大多数微生物数据库的现状。本研究旨在探索一种筛选方法,可从超大规模参考数据集中选取固定规模的k聚体子集用于构建数据库,从而将测序读段(reads)序列分类的精度损失降至最低。本研究的实验内容……,# 数据来源:面向大型进化多样化参考数据库的内存受限型k聚体筛选方法
本数据集关联以下两篇学术论文:
* Şapcı, A. O. B. 与 Mirarab, S. (2024). 《面向大型进化多样化参考数据库的内存受限型k聚体筛选方法》,发表于*Genome Research*(《基因组研究》)。
* Şapcı, A. O. B. 与 Mirarab, S. (2024). 《面向超大型参考数据库的内存受限且分类学感知型k聚体筛选方法》,收录于J. Ma主编的*Research in Computational Molecular Biology*(《计算分子生物学研究》)第340–343页,瑞士施普林格自然出版集团出版。DOI链接:https://doi.org/10.1007/978-1-0716-3989-4_26
可访问链接 https://ter-trees.ucsd.edu/data/krank/ 查看数据库目录与我们为基准测试模拟生成的测序读段数据集。数据库说明与使用教程可在项目官方GitHub仓库中获取。
## KRANK数据库集
`wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz`: WoL-v1轻量型数据库(6.25GB),支持随机模式——极速运行模式
`...`
创建时间:
2025-03-13



