Biased sampling confounds machine learning prediction of antimicrobial resistance

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/zs2mbjv7dn

下载链接

链接失效反馈

官方服务：

资源简介：

Antimicrobial resistance (AMR) poses a growing threat to human health. Increasingly, genome sequencing is being applied for surveillance of bacterial pathogens, producing a wealth of data to train machine learning (ML) applications to predict AMR and identify resistance determinants. However, bacterial populations are highly structured and sampling is biased towards human disease isolates, meaning samples and derived features are not independent. This is rarely considered in applications of ML to AMR. Here, we demonstrate the confounding effects of sample structure by collecting over 24,000 whole genome sequences and AMR phenotypes from five diverse pathogens and constructing realistic pathological training data where resistance is confounded with phylogeny. We show resulting ML models perform poorly, and increasing the training sample size fails to rescue performance. A comprehensive analysis of 6,740 models identifies species- and drug-specific effects on model accuracy. We provide concrete recommendations for evaluating future ML approaches to AMR.

抗菌素耐药性（Antimicrobial Resistance, AMR）正对人类健康构成日益严峻的威胁。当前，基因组测序技术愈发广泛地应用于细菌病原体监测，产生了海量数据，可用于训练机器学习（Machine Learning, ML）模型以预测抗菌素耐药性并鉴定耐药决定因子。然而，细菌种群具有高度结构化特征，且采样偏向于人类疾病分离株，这意味着样本及其衍生特征并不满足独立性假设。这类问题在抗菌素耐药性相关的机器学习应用中极少被纳入考量。本研究通过收集5种不同病原菌的24000余条全基因组序列与抗菌素耐药性表型数据，并构建耐药性与系统发育相互混杂的逼真临床病理训练数据集，阐明了样本结构带来的混杂效应。研究表明，由此训练得到的机器学习模型性能极差，且扩大训练样本规模无法改善模型表现。通过对6740个机器学习模型的全面分析，本研究明确了病原菌种类与药物类型对模型准确率的特异性影响。最后，本研究针对未来抗菌素耐药性相关机器学习研究的评估方法提出了具体可行的建议。

创建时间：

2025-10-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集