Turtle

Name: Turtle
Creator: figshare
Published: 2020-09-05 01:29:10
License: 暂无描述

DataCite Commons2020-09-05 更新2024-07-25 收录

下载链接：

https://figshare.com/articles/dataset/Turtle/791582/1

下载链接

链接失效反馈

官方服务：

资源简介：

We present a novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high coverage libraries and large genomes such as human. Our method is designed to minimize cache-misses in a cache-efficient manner by using a Pattern-blocked Bloom filter to remove infrequent k-mers from consideration in combination with a novel sort-and-compact scheme, instead of a Hash, for the actual counting. While this increases theoretical complexity, the savings in cache misses reduce the empirical running times. A variant can resort to a counting Bloom filter for even larger savings in memory at the expense of false negatives in addition to the false positives common to all Bloom filter based approaches. A comparison to the state-of-the-art shows reduced memory requirements and running times.

本研究提出一种兼顾时间、空间与精度需求的新型方法，可高效提取高频k元组（k-mer），即便针对高覆盖度测序文库与人类基因组等大型基因组场景亦适用。该方法采用缓存优化设计以最小化缓存未命中（cache-misses）：通过模式阻塞布隆过滤器（Pattern-blocked Bloom filter）剔除低频k元组，并结合一种全新的排序压缩方案而非哈希（Hash）完成实际计数。尽管该方法的理论复杂度有所提升，但缓存未命中的减少可有效缩短实际运行时长。该方法的一种变体可采用计数布隆过滤器（counting Bloom filter）以进一步压缩内存占用，但代价是会引入假阴性结果，同时保留所有基于布隆过滤器的方法固有的假阳性问题。与现有顶尖方法的对比结果表明，本方法的内存占用与运行时长均得到优化。

提供机构：

figshare

创建时间：

2016-01-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集