five

Model training and validation/test set.

收藏
Figshare2025-06-10 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Model_training_and_validation_test_set_/29283548
下载链接
链接失效反馈
官方服务:
资源简介:
Genome-wide association studies (GWAS) have successfully uncovered numerous associations between genetic variants and disease traits to date. Yet, identifying significantly associated loci remains a considerable challenge due to the concomitant multiple-testing burden of performing such analyses genome-wide. Here, we leverage the genetic associations of molecular traits – DNA CpG-site methylation status and RNA expression – to mitigate this problem. We encode their co-association across the genome using PinSage, a graph convolutional neural network-based recommender system previously deployed at Pinterest. We demonstrate, using this framework, that a model trained only on methylation quantitative trait locus (QTL) data could recapitulate over half (554,209/1,021,052) of possible SNP-RNA associations identified in a large expression QTL meta-analysis. Taking advantage of a recent ‘saturated’ map of height associations, we then show that height-associated loci predicted by a model trained on molecular-QTL data replicated comparably, following Bonferroni correction, to those that were genome-wide significant in UK Biobank (88% compared to 91%). On a set of 64 disease outcomes in UK Biobank, the same model identified 143 independent novel disease associations, with at least one additional association for 64% (41/64) of the disease outcomes examined. Excluding associations involving the MHC region, we achieve a total uplift of over 8% (128/1,548). We successfully replicated 38% (39/103) of the novel disease associations in an independent sample, with suggestive evidence for six additional associations from GWAS Catalog. Replicated associations included for instance that between rs10774625 (nearest gene: SH2B3/ATXN2) and coeliac disease, and that between rs12350420 (nearest gene: MVB12B) and glaucoma. For many GWAS, attaining such an enhancement by simply increasing sample size may be prohibitively expensive, or impossible depending on disease prevalence.

全基因组关联研究(Genome-wide association studies, GWAS)迄今已成功揭示诸多遗传变异与疾病性状间的关联。然而,由于全基因组分析伴随的多重检验负担,鉴定出具有显著关联的基因座仍是一项颇具挑战的任务。为此,我们借助分子性状——即DNA CpG位点甲基化状态与RNA表达——的遗传关联来缓解这一难题。我们采用此前在Pinterest部署的、基于图卷积神经网络的推荐系统PinSage,对全基因组范围内二者的共关联关系进行编码。借助该框架,我们证实:仅基于甲基化数量性状基因座(methylation quantitative trait locus, QTL)数据训练的模型,可复现大型表达QTL元分析中鉴定出的超半数(554209/1021052)潜在单核苷酸多态性(Single Nucleotide Polymorphism, SNP)-RNA关联。借助近期发布的身高关联“饱和图谱”,我们进一步证明:基于分子QTL数据训练的模型所预测的身高关联基因座,经过Bonferroni校正(Bonferroni correction)后,其可重复性与英国生物银行(UK Biobank)中达到全基因组显著性的基因座相当(分别为88%与91%)。针对英国生物银行中的64种疾病结局,该模型鉴定出143个独立的全新疾病关联,其中64%(41/64)的被研究疾病结局至少存在1个此类新关联。若排除涉及主要组织相容性复合体(Major Histocompatibility Complex, MHC)区域的关联,我们整体的提升幅度可达8%以上(128/1548)。我们在独立样本中成功复现了38%(39/103)的全新疾病关联,另有6个关联在GWAS目录(GWAS Catalog)中显示出提示性证据。复现的关联例如包括rs10774625(最近基因:SH2B3/ATXN2)与乳糜泻(coeliac disease)之间的关联,以及rs12350420(最近基因:MVB12B)与青光眼(glaucoma)之间的关联。对于诸多GWAS而言,仅通过增加样本量来实现此类提升可能成本过高,或是根据疾病患病率而言根本无法实现。
创建时间:
2025-06-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作