On selecting robust approaches for learning predictive biomarkers in metabolomics datasets

Name: On selecting robust approaches for learning predictive biomarkers in metabolomics datasets
Creator: figshare
Published: 2025-08-21 14:37:22
License: 暂无描述

DataCite Commons2025-08-21 更新2025-09-08 收录

下载链接：

https://figshare.com/articles/dataset/On_selecting_robust_approaches_for_learning_predictive_biomarkers_in_metabolomics_datasets/29959079/1

下载链接

链接失效反馈

官方服务：

资源简介：

Code repo (data acquisition) : https://github.com/thibgo/metabolightsbinarydatasetsCode repo (experiments) : https://github.com/thibgo/metabolightsbinarydatasetsArticle : https://www.semanticscholar.org/paper/On-Selecting-Robust-Approaches-for-Learning-in-Data-Godon-Plante/cdac02dd43aa79d5ef5240367ca02dec9ba635e4Abstract :Metabolomics, the study of small molecules within biological systems, offers insights into metabolic processes and, consequently, holds great promise for advancing health outcomes. Biomarker discovery in metabolomics represents a significant challenge, notably due to the high dimensionality of the data. Recent work has addressed this problem by analyzing the most important variables in machine learning models. Unfortunately, this approach relies on prior hypotheses about the structure of the data and may overlook simple patterns. To assess the true usefulness of machine learning methods, we evaluate them on a collection of 835 metabolomics data sets. This effort provides valuable insights for metabolomics researchers regarding where and when to use machine learning. It also establishes a benchmark for the evaluation of future methods. Nonetheless, the results emphasize the high diversity of data sets in metabolomics and the complexity of finding biologically relevant biomarkers. As a result, we propose a novel approach applicable across all data sets, offering guidance for future analyses. This method involves directly comparing univariate and multivariate models. We demonstrate through selected examples how this approach can guide data analysis across diverse data set structures, representative of the observed variability. Code and data are available for research purposes.

代码仓库（数据获取）: https://github.com/thibgo/metabolightsbinarydatasets代码仓库（实验代码）: https://github.com/thibgo/metabolightsbinarydatasets论文: https://www.semanticscholar.org/paper/On-Selecting-Robust-Approaches-for-Learning-in-Data-Godon-Plante/cdac02dd43aa79d5ef5240367ca02dec9ba635e4摘要: 代谢组学（Metabolomics）是对生物系统内小分子的研究，能够解析代谢过程的内在规律，因此在推进健康医疗成果转化方面拥有广阔前景。代谢组学中的生物标志物（Biomarker）筛选是一项极具挑战性的任务，这主要源于数据集的高维度特性。现有研究通过分析机器学习模型中的关键变量来解决这一问题，但这类方法依赖于针对数据结构的先验假设，可能会忽略潜在的简单模式。为客观评估机器学习方法的实际应用价值，本研究在包含835个代谢组学数据集的集合上对各类方法进行了评测。该工作为代谢组学领域的研究者提供了宝贵参考，帮助其明确机器学习方法的适用场景与时机，同时也为后续相关方法的评测建立了标准基准。然而，研究结果也凸显出代谢组学数据集的高度异质性，以及筛选与生物学相关的生物标志物的复杂性。基于此，本研究提出了一种可适配所有数据集的全新方法，为后续数据分析提供了指导性框架。该方法的核心是直接对单变量模型与多变量模型进行对比分析。我们通过精选案例展示了该方法如何针对具有代表性的多样化数据结构，指导数据分析流程。本研究的代码与数据集均可用于科研用途。

提供机构：

figshare

创建时间：

2025-08-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集