Genomic region detection via Spatial Convex Clustering

NIAID Data Ecosystem2026-03-11 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.h8m74

下载链接

链接失效反馈

官方服务：

资源简介：

Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiple chromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. We use ideas from fusion penalties and convex clustering to introduce a method for Spatial Convex Clustering, or SpaCC. Our method is specifically tailored to detecting multi-subject regions of methylation, but we also test our approach on the well-studied problem of detecting segments of copy number variation. We formulate our method as a convex optimization problem, develop a massively parallelizable algorithm to find its solution, and introduce automated approaches for handling missing values and determining tuning parameters. Through simulation studies based on real methylation and copy number variation data, we show that SpaCC exhibits significant performance gains relative to existing methods. Finally, we illustrate SpaCC’s advantages as a pre-processing technique that reduces large-scale genomics data into a smaller number of genomic regions through several cancer epigenetics case studies on subtype discovery, network estimation, and epigenetic-wide association.

当前多款主流基因组学技术，例如DNA甲基化芯片（DNA-Methylation arrays），可对多条染色体上的数十万条空间配准探针（spatially registered probes）开展检测。单纯获取的探针本身在科学研究中价值有限；科研人员真正期望挖掘的是具备生物学可解释性的基因组区域——这类区域由连续的探针集群构成，既可用作疾病生物标志物，也可作为降维预处理步骤服务于下游分析任务。本研究提出一种无监督特征学习技术，可将技术单元（探针）映射至所有研究对象共有的生物学单元（基因组区域）。我们借鉴融合惩罚（fusion penalties）与凸聚类（convex clustering）的相关理论，提出空间凸聚类方法（Spatial Convex Clustering，简称SpaCC）。该方法专为检测多研究对象的甲基化区域设计，同时我们也将其应用于广受关注的拷贝数变异（copy number variation）片段检测任务以验证性能。我们将该方法建模为凸优化问题，开发了可大规模并行的求解算法，并提出了处理缺失值与调节参数自动选择的自动化方案。基于真实甲基化与拷贝数变异数据开展的模拟实验表明，相较于现有方法，SpaCC的性能提升显著。最后，我们通过多组癌症表观遗传学案例研究——涵盖亚型识别、网络推断与表观全基因组关联分析（epigenetic-wide association）——验证了SpaCC作为预处理技术的优势：其可将大规模基因组数据压缩为数量更少的基因组区域。

创建时间：

2019-08-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集