five

Memory-bound k-mer selection for large and evolutionary diverse reference libraries

收藏
DataONE2025-03-10 更新2025-04-26 收录
下载链接:
https://search.dataone.org/view/sha256:61f907f83929eab655b8804bb25c4ec54abba2f83210ce2d05c355968e0be0ec
下载链接
链接失效反馈
官方服务:
资源简介:
Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these down-stream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large dataset to include in a library such that the classification of reads suffers the least. Our experi..., , , # Data from: Memory-bound k-mer selection for large and evolutionary diverse reference libraries Data belonging to the following paper: * Şapcı, A. O. B., & Mirarab, S. (2024). *Memory-bound k-mer selection for large and evolutionary diverse reference libraries*. Genome Research. * Şapcı, A. O. B., & Mirarab, S. (2024). Memory-bound and taxonomy-aware k-mer selection for Ultra-large reference libraries. In J. Ma (Ed.), *Research in Computational Molecular Biology* (pp. 340–343). Springer Nature Switzerland. [https://doi.org/10.1007/978-1-0716-3989-4_26](https://doi.org/10.1007/978-1-0716-3989-4_26) See [https://ter-trees.ucsd.edu/data/krank/](https://ter-trees.ucsd.edu/data/krank/) for a catalog of libraries, and query reads that we had simulated for benchmarking. Descriptions of libraries and a tutorial can be found in the main GitHub repository. ## KRANK libraries * `wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz`: WoL-v1 library lightweight (6.25GB) & with random - fast-mode * `...,

利用k聚体(k-mer)进行序列匹配的方法,正日益广泛应用于诸多生物信息学场景,其中包括宏基因组序列分类任务。这类下游应用的分类精度依赖于参考数据库的覆盖密度,而幸运的是,这类数据库的规模正快速扩张。尽管数据库覆盖密度的提升为分类精度的显著改善带来了可能,但其可扩展性问题却不容忽视。在序列查询阶段,参考k聚体需加载至内存中存储;而随着数据库规模持续扩张,存储全部k聚体正快速变得不切实际。目前已提出多种子采样策略,包括最小化器(minimizers)以及筛选类群特异性k聚体的方法。然而我们认为,这些策略存在明显局限,尤其当参考数据集的类群分布不均衡时——正如绝大多数微生物数据库的现状。本研究旨在探索一种筛选方法,可从超大规模参考数据集中选取固定规模的k聚体子集用于构建数据库,从而将测序读段(reads)序列分类的精度损失降至最低。本研究的实验内容……,# 数据来源:面向大型进化多样化参考数据库的内存受限型k聚体筛选方法 本数据集关联以下两篇学术论文: * Şapcı, A. O. B. 与 Mirarab, S. (2024). 《面向大型进化多样化参考数据库的内存受限型k聚体筛选方法》,发表于*Genome Research*(《基因组研究》)。 * Şapcı, A. O. B. 与 Mirarab, S. (2024). 《面向超大型参考数据库的内存受限且分类学感知型k聚体筛选方法》,收录于J. Ma主编的*Research in Computational Molecular Biology*(《计算分子生物学研究》)第340–343页,瑞士施普林格自然出版集团出版。DOI链接:https://doi.org/10.1007/978-1-0716-3989-4_26 可访问链接 https://ter-trees.ucsd.edu/data/krank/ 查看数据库目录与我们为基准测试模拟生成的测序读段数据集。数据库说明与使用教程可在项目官方GitHub仓库中获取。 ## KRANK数据库集 `wol_v1-lib_rand_free-k29_w35_h13_b16_s8.tar.gz`: WoL-v1轻量型数据库(6.25GB),支持随机模式——极速运行模式 `...`
创建时间:
2025-03-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作