five

CONSULT: accurate contamination removal using locality-sensitive hashing

收藏
DataONE2024-03-29 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:448af7b098fa3f6e7dd63c919bc7c95e9adc83c99fa611f8024714447dc09da0
下载链接
链接失效反馈
官方服务:
资源简介:
A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advant..., , , # Access to the data used for CONSULT benchmarking Date belonging to the following paper: * Rachtman, E., Bafna, V., & Mirarab, S. (2021). CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics and Bioinformatics. [doi:10.1093/nargab/lqab071](https://doi.org/10.1093/nargab/lqab071) ## Description of the data and file structure ## Drosophila data Genome and genome skims used for real Drosophila data analysis are provided. ### Before clean-up #### `Dros_fastq_af_bbmerge.tar` This file contains deduplicated reads for Drosophila species before clean-up It contains the following Drosophila species in fq format: * `sub_Drosophila_ananassae_2.fq.gz`: Drosophila ananassae * `sub_Drosophila_biarmipes_2.fq.gz`: Drosophila biarmipes * `sub_Drosophila_bipectinata_2.fq.gz`: Drosophila bipectinata * `sub_Drosophila_erecta_2.fq.gz`: Drosophila erecta * `sub_Drosophila_eugracilis_2.fq.gz`: Drosophila eugracilis * `sub_Drosophila_mauritiana...
创建时间:
2025-07-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作