Mapping CpG Sites to Genes.

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://figshare.com/articles/dataset/Mapping_CpG_Sites_to_Genes_/23062735

下载链接

链接失效反馈

官方服务：

资源简介：

Integrative approaches that simultaneously model multi-omics data have gained increasing popularity because they provide holistic system biology views of multiple or all components in a biological system of interest. Canonical correlation analysis (CCA) is a correlation-based integrative method designed to extract latent features shared between multiple assays by finding the linear combinations of features–referred to as canonical variables (CVs)–within each assay that achieve maximal across-assay correlation. Although widely acknowledged as a powerful approach for multi-omics data, CCA has not been systematically applied to multi-omics data in large cohort studies, which has only recently become available. Here, we adapted sparse multiple CCA (SMCCA), a widely-used derivative of CCA, to proteomics and methylomics data from the Multi-Ethnic Study of Atherosclerosis (MESA) and Jackson Heart Study (JHS). To tackle challenges encountered when applying SMCCA to MESA and JHS, our adaptations include the incorporation of the Gram-Schmidt (GS) algorithm with SMCCA to improve orthogonality among CVs, and the development of Sparse Supervised Multiple CCA (SSMCCA) to allow supervised integration analysis for more than two assays. Effective application of SMCCA to the two real datasets reveals important findings. Applying our SMCCA-GS to MESA and JHS, we identified strong associations between blood cell counts and protein abundance, suggesting that adjustment of blood cell composition should be considered in protein-based association studies. Importantly, CVs obtained from two independent cohorts also demonstrate transferability across the cohorts. For example, proteomic CVs learned from JHS, when transferred to MESA, explain similar amounts of blood cell count phenotypic variance in MESA, explaining 39.0% ~ 50.0% variation in JHS and 38.9% ~ 49.1% in MESA. Similar transferability was observed for other omics-CV-trait pairs. This suggests that biologically meaningful and cohort-agnostic variation is captured by CVs. We anticipate that applying our SMCCA-GS and SSMCCA on various cohorts would help identify cohort-agnostic biologically meaningful relationships between multi-omics data and phenotypic traits.

可同时建模多组学数据的整合分析方法愈发受到青睐，因其能为目标生物系统中的多种乃至全部组分提供整体系统生物学视角。典型相关分析（Canonical Correlation Analysis，CCA）是一类基于相关性的整合方法，旨在通过在每个检测实验中寻找特征的线性组合——该组合被称为典型变量（Canonical Variables，CVs）——以实现不同检测间的最大相关性，从而提取多组学检测间共享的潜在特征。尽管CCA已被公认为多组学数据建模的高效手段，但此前尚未在大型队列研究中得到系统性应用，而这类大型队列直至近年才得以普及。本研究针对稀疏多重典型相关分析（Sparse Multiple CCA，SMCCA）——一种应用广泛的CCA衍生方法——进行适配改造，将其应用于来自多种族动脉粥样硬化研究（Multi-Ethnic Study of Atherosclerosis，MESA）与杰克逊心脏研究（Jackson Heart Study，JHS）的蛋白质组学与甲基化组学数据。针对将SMCCA应用于MESA与JHS时遇到的挑战，我们的改造方案包括：将格拉姆-施密特（Gram-Schmidt，GS）算法与SMCCA相结合，以提升典型变量间的正交性；同时开发稀疏监督多重典型相关分析（Sparse Supervised Multiple CCA，SSMCCA），以支持针对两类以上检测实验的监督式整合分析。将SMCCA有效应用于这两个真实数据集后，我们得到了多项重要发现。通过将SMCCA-GS方法应用于MESA与JHS数据，我们鉴定出血细胞计数与蛋白质丰度间存在强关联，这提示在基于蛋白质组学的关联研究中，应当考虑对血细胞组成进行校正。尤为重要的是，从两个独立队列中得到的典型变量还展现出了跨队列的可迁移性。例如，从JHS中学习得到的蛋白质组学典型变量，迁移至MESA后可解释相近比例的血细胞计数表型方差：在JHS中该变量可解释39.0%~50.0%的变异，在MESA中则可解释38.9%~49.1%的变异。其他组学-典型变量-性状对也展现出了类似的跨队列可迁移性。这表明典型变量能够捕捉到具有生物学意义且与队列无关的变异。我们预期，将SMCCA-GS与SSMCCA应用于各类队列研究，将有助于鉴定多组学数据与表型性状间具有队列无关性的生物学关联。

创建时间：

2023-05-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集