Table 1_Integrating GWAS and machine learning for disease risk prediction in the Taiwanese Hakka population.xlsx

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Table_1_Integrating_GWAS_and_machine_learning_for_disease_risk_prediction_in_the_Taiwanese_Hakka_population_xlsx/30788591

下载链接

链接失效反馈

官方服务：

资源简介：

IntroductionGenome-wide association studies (GWAS) have identified numerous loci associated with complex diseases, yet their predictive power in small or genetically homogeneous populations remains limited. Integrating machine learning with GWAS offers a path to improve risk prediction and uncover functional variants relevant to precision medicine. MethodsDNA samples from Taiwanese Hakka individuals with type 2 diabetes, hypertension, and eye diseases were analyzed. After standard quality control, 295,589 SNPs were retained. Fourteen machine-learning algorithms were evaluated using SNPs selected through traditional GWAS filtering and refined via wrapper-based feature selection with a best-first search algorithm. Model performance was assessed by internal cross-validation and external validation using Taiwan Biobank data, and functional annotation was conducted through GTEx v10 cis-eQTL analysis. ResultsPredictive models relying solely on significant GWAS SNPs achieved moderate internal accuracy but limited generalizability. Incorporating feature-selected SNPs markedly improved performance: the Random Forest model achieved accuracies above 88% in cross-validation and above 85% in external validation, confirmed by 1,000× bootstrap resampling. eQTL analysis identified functional associations such as rs12121653-KDM5B and rs12121653-MGAT4EP, implicating pathways involved in metabolic and mitochondrial regulation. DiscussionThese findings demonstrate that integrating GWAS with machine-learning-based feature selection enables the construction of robust, population-specific disease risk models. Given the small sample size of the discovery cohort (n = 96), all predictive results should be interpreted as exploratory. We employed stringent cross-validation and 1,000× bootstrap resampling to reduce overfitting, and genomic control metrics (QQ plots and λGC values) were evaluated to ensure no major test statistic inflation. Independent large-scale validation will still be required. The approach effectively captures additive and interaction-driven genetic components and provides a scalable framework for applying precision medicine to underrepresented or isolated populations.

创建时间：

2025-12-04