Supplementary Material 9

Name: Supplementary Material 9
Creator: figshare
Published: 2025-05-12 08:08:27
License: 暂无描述

DataCite Commons2025-05-12 更新2025-09-08 收录

下载链接：

https://figshare.com/articles/dataset/Supplementary_Material_9/28601087

下载链接

链接失效反馈

官方服务：

资源简介：

CD-HIT (Cluster Database at High Identity with Tolerance) is a widely used clustering algorithm that reduces redundancy in large genomic datasets. CD-HIT can group similar sequences or genomic features when applied to machine learning results in Escherichia coli genomic analysis, improving model efficiency and reducing computational complexity.CD-HIT in machine learning-based E. coli genomic analysis:Feature reduction: In supervised machine learning, CD-HIT can cluster similar sequences from genomic data, eliminating redundant information and improving feature selection for classifiers like Random Forest, XGBoost, and SVM.Antimicrobial resistance (AMR) analysis: By clustering resistance gene sequences at high identity thresholds (e.g., 90% or 95%), CD-HIT helps identify unique resistance patterns while reducing data redundancy.Virulence gene clustering: CD-HIT can be applied to virulence factor datasets to cluster homologous genes, aiding in classifying pathogenic E. coli strains.Improving model generalization: Reducing sequence redundancy prevents overfitting in machine learning models, leading to better generalization in predicting resistant and susceptible E. coli strains.Computational efficiency: CD-HIT optimizes data processing by clustering sequences before applying machine learning models, making large-scale genomic analyses more feasible.CD-HIT results interpretation:High identity clustering (e.g., 95-100%): Groups highly similar sequences, ensuring that only distinct genetic variations contribute to machine learning predictions.Lower identity clustering (e.g., 70-90%) Groups more diverse sequences, which is helpful for broader strain classification or identifying distant homologs in resistance genes.CD-HIT results help refine machine learning predictions in E. coli genomic analysis by ensuring that training data remains diverse and balanced.

提供机构：

figshare

创建时间：

2025-05-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集