Supporting data for "VariantSpark: A Distributed Implementation of Random Forest Tailored for Ultra High Dimensional Genomic Data"

Name: Supporting data for "VariantSpark: A Distributed Implementation of Random Forest Tailored for Ultra High Dimensional Genomic Data"
Creator: GigaScience Database
Published: 2025-05-26 17:20:03
License: 暂无描述

DataCite Commons2025-05-26 更新2025-04-15 收录

下载链接：

http://gigadb.org/dataset/100759

下载链接

链接失效反馈

官方服务：

资源简介：

Many traits and diseases are thought to be driven by more than one gene (polygenic). Polygenic Risk Scores (PRS) hence expand on Genome-Wide Association Studies (GWAS) by taking multiple genes into account when building risk models. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions are found in small datasets, large datasets have not been processed yet due to the high computational complexity of the search for epistatic interactions. We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to whole-genome of population-scale datasets with a hundred million genomic variants and hundred thousand samples. Compared to traditional monogenic GWAS, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high dimensional genomic data in a manageable time.

提供机构：

GigaScience Database

创建时间：

2020-06-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集