Supporting data for "A graph clustering algorithm for detection and genotyping of structural variants from long reads"

Name: Supporting data for "A graph clustering algorithm for detection and genotyping of structural variants from long reads"
Creator: GigaScience Database
Published: 2025-05-26 16:58:47
License: 暂无描述

DataCite Commons2025-05-26 更新2024-07-13 收录

下载链接：

http://gigadb.org/dataset/102475

下载链接

链接失效反馈

官方服务：

资源简介：

Structural variants (SV) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been recently developed. We present an accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence (Signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated into the single sample variants detector of the Next Generation Sequencing Experience Platform (NGSEP), which facilitates the integration with other functionalities for genomics analysis. We performed multiple benchmark experiments including simulation, and real data, representing different genome profiles, sequencing technologies (PacBio HiFi, ONT), and read-depths. The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.

结构变异（Structural variants，SV）是一类长度超过50碱基对的基因组多态性。常见的结构变异类型包括缺失、插入、易位、倒位以及拷贝数变异。鉴于结构变异在表型变异、进化事件等诸多生物学现象中发挥的关键作用，其检测与基因分型研究是基因组学领域的核心基础。因此，近年来学界相继开发出一系列基于长读长测序数据的结构变异识别方法。本研究提出了一款精准高效的算法，可基于长读长测序数据预测生殖系结构变异。该算法首先从读段比对结果中提取结构变异的证据（Signatures）；随后，基于由读段长度与基因组位置计算得到的坐标构建欧氏图，并通过该图对提取的证据进行聚类。聚类步骤采用DBSCAN算法实现，该算法可高分辨率地划定聚类边界，具备显著优势。聚类结果将被转化为结构变异候选集，随后通过贝叶斯模型，基于其支持证据对结构变异进行精准的基因分型。本算法已集成至下一代测序体验平台（Next Generation Sequencing Experience Platform, NGSEP）的单样本变异检测模块中，可便捷地与其他基因组学分析功能联动。本研究开展了多组基准测试实验，涵盖模拟数据与真实数据，覆盖不同基因组特征、测序技术（PacBio HiFi、ONT）以及测序深度场景。实验结果表明，相较于当前主流工具，本方法在生殖系结构变异识别与基因分型任务中表现更优，尤其在低测序深度以及易出错的重复基因组区域中优势更为显著。本研究认为，该工作可为最大化利用长读长测序技术的生物信息学策略开发提供重要助力。

提供机构：

GigaScience Database

创建时间：

2023-12-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集