A machine learning framework to predict cancer metabolomics from gene-expression data

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE299429

下载链接

链接失效反馈

官方服务：

资源简介：

Metabolomics provides a direct functional readout of a tumor’s physiology. Yet, it is lagging behind other omics technologies in facilitating disease monitoring and prognostication. This stems partly from the scarcity of large-scale metabolomic studies, but also the analytical complexities of detecting diverse metabolites with varying physicochemical properties and concentrations. To address this, we developed a machine learning framework using both tumor tissue and cell line samples across multiple cancer types that allows prediction of metabolomics from gene expression data. Two different model types were selected and trained for tissues and cell lines with their generalization capacity validated on independent cohorts, accurately predicting as high as 70-80% of tested metabolites. This work offers a scalable and efficient machine learning pipeline to determine metabolic from transcriptomic signatures, opening avenues to reconstruct and study the metabolic landscape of samples across novel and existing datasets lacking direct metabolomics measurements. RNA-sequencing profile of MCF7 (PIK3CA wild-type (WT) and E545K mutant (MUT)) and MCF10A (PIK3CA wild-type (WT), E545K and H1047R mutant (MUT)) isogenic cell lines.

代谢组学（Metabolomics）可直接反映肿瘤生理状态的功能特征。然而在助力疾病监测与预后评估领域，其发展却滞后于其他组学技术。该局限一方面源于大规模代谢组学研究的匮乏，另一方面则源于检测具有不同理化性质与浓度的多样代谢物所面临的分析复杂性。为解决上述问题，本研究开发了一款机器学习框架：该框架采用覆盖多种癌症类型的肿瘤组织与细胞系样本，可基于基因表达数据预测代谢组学特征。研究针对组织样本与细胞系分别选取了两种不同的模型类型开展训练，并在独立队列中验证了模型的泛化能力，最终可精准预测多达70%~80%的待测代谢物。本研究提供了一套可扩展且高效的机器学习流程，可基于转录组特征推断代谢组特征，为在缺乏直接代谢组学检测数据的新型与现有数据集中重构并研究样本的代谢图谱开辟了全新路径。本数据集包含MCF7（PIK3CA野生型（WT）与E545K突变型（MUT））及MCF10A（PIK3CA野生型（WT）、E545K与H1047R突变型（MUT））同基因细胞系的RNA测序（RNA-sequencing）谱。

创建时间：

2025-09-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集