Supplementary Material 8
收藏Figshare2025-05-12 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Supplementary_Material_8/28601057
下载链接
链接失效反馈官方服务:
资源简介:
The Synthetic Minority Over-sampling Technique (SMOTE) is a machine learning approach to address class imbalance in datasets. It is beneficial for identifying antimicrobial resistance (AMR) patterns. In AMR studies, datasets often contain more susceptible isolates than resistant ones, leading to biased model performance. SMOTE overcomes this issue by generating synthetic samples of the minority class (resistant isolates) through interpolation rather than simple duplication, thereby improving model generalization.When applied to AMR prediction, SMOTE enhances the ability of classification models to accurately identify resistant Escherichia coli strains by balancing the dataset, ensuring that machine learning algorithms do not overlook rare resistance patterns. It is commonly used with classifiers like decision trees, support vector machines (SVM), and deep learning models to improve predictive accuracy. By mitigating class imbalance, SMOTE enables robust AMR detection, aiding in early identification of drug-resistant bacteria and informing antibiotic stewardship efforts.Supervised machine learning is widely used in Escherichia coli genomic analysis to predict antimicrobial resistance, virulence factors, and strain classification. By training models on labeled genomic data (e.g., the presence or absence of resistance genes, SNP profiles, or MLST types), these classifiers help identify patterns and make accurate predictions.10 Supervised machine learning classifiers for E.coli genome analysis:Logistic regression (LR): A simple yet effective statistical model for binary classification, such as predicting antibiotic resistance or susceptibility in E. coli.Linear support vector machine (Linear SVM): This machine finds the optimal hyperplane to separate E. coli strains based on genomic features such as gene presence or sequence variations.Radial basis function kernel-support vector machine (RBF-SVM): A more flexible version of SVM that uses a non-linear kernel to capture complex relationships in genomic data, improving classification accuracy.Extra trees classifier: This tree-based ensemble method enhances classification by randomly selecting features and thresholds, improving robustness in E. coli strain differentiation.Random forest (RF): An ensemble learning method that constructs multiple decision trees, reducing overfitting and improving prediction accuracy for resistance genes and virulence factors.Adaboost: A boosting algorithm that combines weak classifiers iteratively, refining predictions and improving the identification of antimicrobial resistance patterns.XGboost: An optimized gradient boosting algorithm that efficiently handles large genomic datasets, commonly used for high-accuracy predictions in E. coli classification.Naïve bayes (NB): A probabilistic classifier based on Bayes' theorem, suitable for predicting resistance phenotypes based on genomic features.Linear discriminant Analysis (LDA) is a statistical approach that maximizes class separability. It helps distinguish between resistant and susceptible E. coli strains.Quadratic discriminant Analysis (QDA) is a variation of LDA that allows for non-linear decision boundaries, improving classification in datasets with complex genomic structures. When applied to E. coli genomes, these classifiers help predict antibiotic resistance, track outbreak strains, and understand genomic adaptations. Combining them with feature selection and optimization techniques enhances accuracy, making them valuable tools in bacterial genomics and clinical research.
创建时间:
2025-05-12



