Data_Sheet_2_Machine learning approaches in microbiome research: challenges and best practices.docx

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://figshare.com/articles/dataset/Data_Sheet_2_Machine_learning_approaches_in_microbiome_research_challenges_and_best_practices_docx/24181983

下载链接

链接失效反馈

官方服务：

资源简介：

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.

机器学习（Machine Learning，ML）工作流内的微生物组数据预测分析面临诸多领域特异性挑战，涵盖数据预处理、特征选择、预测建模、性能评估、模型解读，以及从分析结果中提取生物学信息。为辅助决策制定，本研究基于欧洲科学与技术合作组织（COST）行动ML4Microbiome的成果，提出了一套关于算法选择、流程构建与评估的建议。我们针对结直肠癌患者的多队列鸟枪法宏基因组学（shotgun metagenomics）数据集，对所提出的各类方法开展对比分析，重点考察其在疾病诊断与生物标志物发现领域的性能表现。研究表明，将成分变换与滤波方法作为数据预处理的一部分，并不总能提升模型的预测性能。与之相反，多变量特征选择方法（如统计等效特征签名（Statistically Equivalent Signatures）算法）可有效降低分类误差。当在独立测试数据集上进行验证时，该算法与随机森林（Random Forest）建模相结合，可提供最为精准的性能评估结果。最后，本研究展示了如何将逻辑回归（logistic regression）线性建模与可视化技术（如个体条件期望（ICE）图）相结合，以获得可解读的分析结果并提供生物学层面的见解。本研究结果对于转化应用场景中的临床医生与非专业人士均具有重要价值。

创建时间：

2023-09-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集