five

Source code for StrVCTVRE: a supervised learning method to predict the pathogenicity of human genome structural variants

收藏
DataCite Commons2025-06-01 更新2025-06-15 收录
下载链接:
https://datadryad.org/dataset/doi:10.6078/D1GM63
下载链接
链接失效反馈
官方服务:
资源简介:
Whole genome sequencing resolves many clinical cases where standard diagnostic methods have failed. However, at least half of these cases remain unresolved after whole genome sequencing. Structural variants (SVs; genomic variants larger than 50 base pairs) of uncertain significance are the genetic cause of a portion of these unresolved cases. As sequencing methods using long or linked reads become more accessible and SV detection algorithms improve, clinicians and researchers are gaining access to thousands of reliable SVs of unknown disease relevance. Methods to predict the pathogenicity of these SVs are required to realize the full diagnostic potential of long-read sequencing. To address this emerging need, we developed StrVCTVRE to distinguish pathogenic SVs from benign SVs that overlap exons. In a random forest classifier, we integrated features that capture gene importance, coding region, conservation, expression, and exon structure. We found that features such as expression and conservation are important but are absent from SV classification guidelines. We leveraged multiple resources to construct a size-matched training set of rare, putatively benign and pathogenic SVs. StrVCTVRE performs accurately across a wide SV size range on independent test sets, which will allow clinicians and researchers to eliminate about half of SVs from consideration while retaining a 90% sensitivity. We anticipate clinicians and researchers will use StrVCTVRE to prioritize SVs in patients where no SV is immediately compelling, empowering deeper investigation into novel SVs to resolve cases and understand new mechanisms of disease. StrVCTVRE runs rapidly and is available at https://compbio.berkeley.edu/proj/strvctvre/.

全基因组测序可解决诸多标准诊断方法未能明确诊断的临床病例。然而,经全基因组测序后仍有至少半数此类病例依旧无法得到确诊。意义未明的结构变异(Structural Variants, SVs,即长度大于50个碱基对的基因组变异)是部分此类未决病例的遗传学病因。随着长读长或链读测序技术愈发普及,且结构变异检测算法持续精进,临床医生与研究人员得以获取数千个疾病相关性未知的可靠结构变异。为充分发挥长读长测序的全部诊断潜力,亟需开发可预测此类结构变异致病性的分析方法。为应对这一新兴需求,我们开发了StrVCTVRE工具,用于区分与外显子区域重叠的致病性结构变异与良性结构变异。本工具基于随机森林分类器,整合了涵盖基因重要性、编码区域、进化保守性、基因表达及外显子结构的多类特征。研究发现,诸如基因表达与进化保守性这类关键特征,目前并未纳入结构变异分类指南之中。我们依托多组学资源构建了规模匹配的罕见疑似良性与致病性结构变异训练集。StrVCTVRE在独立测试集的宽泛结构变异长度区间内均表现出优异的分类精度,可帮助临床医生与研究人员在保留90%灵敏度的前提下,筛除约半数无需优先考量的结构变异。我们预计,临床医生与研究人员可借助StrVCTVRE对暂无明确致病性的患者体内结构变异进行优先级排序,从而推动针对新型结构变异的深入研究,以攻克未决病例并解析全新的疾病致病机制。StrVCTVRE运行速度迅捷,可通过https://compbio.berkeley.edu/proj/strvctvre/获取使用。
提供机构:
Dryad
创建时间:
2021-08-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作