High-Order SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions

NIAID Data Ecosystem2026-03-07 收录

下载链接：

https://figshare.com/articles/dataset/High_Order_SNP_Combinations_Associated_with_Complex_Diseases_Efficient_Discovery_Statistical_Power_and_Functional_Interactions/126072

下载链接

链接失效反馈

官方服务：

资源简介：

There has been increased interest in discovering combinations of single-nucleotide polymorphisms (SNPs) that are strongly associated with a phenotype even if each SNP has little individual effect. Efficient approaches have been proposed for searching two-locus combinations from genome-wide datasets. However, for high-order combinations, existing methods either adopt a brute-force search which only handles a small number of SNPs (up to few hundreds), or use heuristic search that may miss informative combinations. In addition, existing approaches lack statistical power because of the use of statistics with high degrees-of-freedom and the huge number of hypotheses tested during combinatorial search. Due to these challenges, functional interactions in high-order combinations have not been systematically explored. We leverage discriminative-pattern-mining algorithms from the data-mining community to search for high-order combinations in case-control datasets. The substantially improved efficiency and scalability demonstrated on synthetic and real datasets with several thousands of SNPs allows the study of several important mathematical and statistical properties of SNP combinations with order as high as eleven. We further explore functional interactions in high-order combinations and reveal a general connection between the increase in discriminative power of a combination over its subsets and the functional coherence among the genes comprising the combination, supported by multiple datasets. Finally, we study several significant high-order combinations discovered from a lung-cancer dataset and a kidney-transplant-rejection dataset in detail to provide novel insights on the complex diseases. Interestingly, many of these associations involve combinations of common variations that occur in small fractions of population. Thus, our approach is an alternative methodology for exploring the genetics of rare diseases for which the current focus is on individually rare variations.

近年来，针对单核苷酸多态性（single-nucleotide polymorphisms, SNPs）组合的挖掘研究日益受到关注——即便单个SNP的独立效应微弱，这类组合仍可与表型产生强关联。此前已有研究提出高效方法，可从全基因组数据集（genome-wide datasets）中挖掘双位点SNP组合。但针对高阶SNP组合，现有方法要么采用仅能处理少量SNP（最多数百个）的暴力搜索，要么使用可能遗漏有效信息组合的启发式搜索策略。此外，由于采用了高自由度统计量，且组合搜索过程中需检验的假设数量极多，现有方法的统计效力不足。受限于这些挑战，高阶SNP组合中的功能互作尚未得到系统性研究。本研究借助数据挖掘领域的判别模式挖掘算法，对病例对照数据集（case-control datasets）中的高阶SNP组合进行挖掘。在包含数千个SNP的模拟与真实数据集上的验证结果表明，本方法的效率与可扩展性得到显著提升，从而支持对阶数高达11的SNP组合的多项重要数学与统计性质展开研究。本研究进一步探究了高阶SNP组合中的功能互作，并通过多组数据集验证，揭示了组合相较于其子集的判别效力提升与组合所包含基因间的功能一致性之间的普遍关联。最后，本研究对从肺癌数据集与肾移植排斥数据集（kidney-transplant-rejection dataset）中挖掘得到的若干重要高阶SNP组合展开详细分析，为理解这类复杂疾病提供了全新视角。值得注意的是，这类关联多涉及在小部分人群中存在的常见变异组合。当前罕见病遗传学研究多聚焦于单个罕见变异，而本方法为该领域的探索提供了一种替代性研究范式。

创建时间：

2016-01-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集