five

Supporting data for "VariantSpark: A Distributed Implementation of Random Forest Tailored for Ultra High Dimensional Genomic Data"

收藏
DataCite Commons2025-05-26 更新2025-04-15 收录
下载链接:
http://gigadb.org/dataset/100759
下载链接
链接失效反馈
官方服务:
资源简介:
Many traits and diseases are thought to be driven by more than one gene (polygenic). Polygenic Risk Scores (PRS) hence expand on Genome-Wide Association Studies (GWAS) by taking multiple genes into account when building risk models. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions are found in small datasets, large datasets have not been processed yet due to the high computational complexity of the search for epistatic interactions. We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to whole-genome of population-scale datasets with a hundred million genomic variants and hundred thousand samples. Compared to traditional monogenic GWAS, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high dimensional genomic data in a manageable time.
提供机构:
GigaScience Database
创建时间:
2020-06-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作