Memory-bound k-mer selection for large and evolutionary diverse reference libraries

DataONE2025-03-10 更新2025-04-26 收录

下载链接：

https://search.dataone.org/view/sha256:61f907f83929eab655b8804bb25c4ec54abba2f83210ce2d05c355968e0be0ec

下载链接

链接失效反馈

官方服务：

资源简介：

Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these down-stream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large dataset to include in a library such that the classification of reads suffers the least. Our experi..., , , # Data from: Memory-bound k-mer selection for large and evolutionary diverse reference libraries Data belonging to the following paper: * ÅapcÄ±, A. O. B., & Mirarab, S. (2024). *Memory-bound k-mer selection for large and evolutionary diverse reference libraries*. Genome Research. * ÅapcÄ±, A. O. B., & Mirarab, S. (2024). Memory-bound andÂ taxonomy-aware k-mer selection forÂ Ultra-large reference libraries. In J. Ma (Ed.), *Research in Computational Molecular Biology* (pp. 340â343). Springer Nature Switzerland. [https://doi.org/10.1007/978-1-0716-3989-4_26](https://doi.org/10.1007/978-1-0716-3989-4_26) See [https://ter-trees.ucsd.edu/data/krank/](https://ter-trees.ucsd.edu/data/krank/) for a catalog of libraries, and query reads that we had simulated for benchmarking. Descriptions of libraries and a tutorial can be found in the main GitHub repository. ## KRANK libraries * `wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz`: WoL-v1 library lightweight (6.25GB) & with random - fast-mode * `...,

利用k聚体（k-mer）进行序列匹配的方法，正日益广泛应用于诸多生物信息学场景，其中包括宏基因组序列分类任务。这类下游应用的分类精度依赖于参考数据库的覆盖密度，而幸运的是，这类数据库的规模正快速扩张。尽管数据库覆盖密度的提升为分类精度的显著改善带来了可能，但其可扩展性问题却不容忽视。在序列查询阶段，参考k聚体需加载至内存中存储；而随着数据库规模持续扩张，存储全部k聚体正快速变得不切实际。目前已提出多种子采样策略，包括最小化器（minimizers）以及筛选类群特异性k聚体的方法。然而我们认为，这些策略存在明显局限，尤其当参考数据集的类群分布不均衡时——正如绝大多数微生物数据库的现状。本研究旨在探索一种筛选方法，可从超大规模参考数据集中选取固定规模的k聚体子集用于构建数据库，从而将测序读段（reads）序列分类的精度损失降至最低。本研究的实验内容……，# 数据来源：面向大型进化多样化参考数据库的内存受限型k聚体筛选方法本数据集关联以下两篇学术论文： * Şapcı, A. O. B. 与 Mirarab, S. (2024). 《面向大型进化多样化参考数据库的内存受限型k聚体筛选方法》，发表于*Genome Research*（《基因组研究》）。 * Şapcı, A. O. B. 与 Mirarab, S. (2024). 《面向超大型参考数据库的内存受限且分类学感知型k聚体筛选方法》，收录于J. Ma主编的*Research in Computational Molecular Biology*（《计算分子生物学研究》）第340–343页，瑞士施普林格自然出版集团出版。DOI链接：https://doi.org/10.1007/978-1-0716-3989-4_26 可访问链接 https://ter-trees.ucsd.edu/data/krank/ 查看数据库目录与我们为基准测试模拟生成的测序读段数据集。数据库说明与使用教程可在项目官方GitHub仓库中获取。 ## KRANK数据库集 `wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz`: WoL-v1轻量型数据库（6.25GB），支持随机模式——极速运行模式 `...`

创建时间：

2025-03-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集