five

Turtle

收藏
DataCite Commons2020-09-05 更新2024-07-25 收录
下载链接:
https://figshare.com/articles/dataset/Turtle/791582/3
下载链接
链接失效反馈
官方服务:
资源简介:
We present a novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high coverage libraries and large genomes such as human. Our method is designed to minimize cache-misses in a cache-efficient manner by using a Pattern-blocked Bloom filter to remove infrequent k-mers from consideration in combination with a novel sort-and-compact scheme, instead of a Hash, for the actual counting. While this increases theoretical complexity, the savings in cache misses reduce the empirical running times. A variant can resort to a counting Bloom filter for even larger savings in memory at the expense of false negatives in addition to the false positives common to all Bloom filter based approaches. A comparison to the state-of-the-art shows reduced memory requirements and running times.

我们提出了一种兼顾时间、空间与精度需求的新型方法,可高效提取高频k-mer(k-mer),即便针对高覆盖度测序文库及人类等大型基因组亦适用。本方法旨在以缓存高效的设计范式最小化缓存缺失(cache miss):通过采用模式分块布隆过滤器(Pattern-blocked Bloom filter)剔除待考量的低频k-mer,并结合一种新颖的排序压缩方案替代哈希(Hash)完成实际计数环节。尽管该方法会提升理论复杂度,但缓存缺失的减少可显著降低实际运行时长。其变体可采用计数布隆过滤器(counting Bloom filter)以进一步节省内存,代价是会引入所有基于布隆过滤器的方法共有的假阳性之外的假阴性。与当前顶尖方法的对比结果表明,本方法可降低内存需求与运行时长。
提供机构:
figshare
创建时间:
2016-01-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作