Supporting data for "ntsm: an alignment-free, ultra low coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection"

Name: Supporting data for "ntsm: an alignment-free, ultra low coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection"
Creator: GigaScience Database
Published: 2025-05-26 16:58:39
License: 暂无描述

DataCite Commons2025-05-26 更新2024-07-13 收录

下载链接：

http://gigadb.org/dataset/102521

下载链接

链接失效反馈

官方服务：

资源简介：

Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g. mix of Oxford Nanopore, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g. if data is only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important. <br>The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e. missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed PCA-based pre-screening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons. <br>Because this tool processes raw data, is faster than alignment, and can be used on very low coverage data, it can save an immense degree of computational resources in standard QC pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample-swap detection, this method provides other useful information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.

提供机构：

GigaScience Database

创建时间：

2024-04-12