five

Supplementary Material 9

收藏
DataCite Commons2025-05-12 更新2025-09-08 收录
下载链接:
https://figshare.com/articles/dataset/Supplementary_Material_9/28601087
下载链接
链接失效反馈
官方服务:
资源简介:
CD-HIT (Cluster Database at High Identity with Tolerance) is a widely used clustering algorithm that reduces redundancy in large genomic datasets. CD-HIT can group similar sequences or genomic features when applied to machine learning results in Escherichia coli genomic analysis, improving model efficiency and reducing computational complexity.<b>CD-HIT in machine learning-based </b><b><i>E. coli</i></b><b> genomic analysis:</b><b>Feature reduction:</b> In supervised machine learning, CD-HIT can cluster similar sequences from genomic data, eliminating redundant information and improving feature selection for classifiers like Random Forest, XGBoost, and SVM.<b>Antimicrobial resistance (AMR) analysis:</b> By clustering resistance gene sequences at high identity thresholds (e.g., 90% or 95%), CD-HIT helps identify unique resistance patterns while reducing data redundancy.<b>Virulence gene clustering:</b> CD-HIT can be applied to virulence factor datasets to cluster homologous genes, aiding in classifying pathogenic <i>E. coli</i> strains.<b>Improving model generalization:</b> Reducing sequence redundancy prevents overfitting in machine learning models, leading to better generalization in predicting resistant and susceptible <i>E. coli</i> strains.<b>Computational efficiency:</b> CD-HIT optimizes data processing by clustering sequences before applying machine learning models, making large-scale genomic analyses more feasible.<b>CD-HIT results interpretation:</b><b>High identity clustering (e.g., 95-100%)</b>: Groups highly similar sequences, ensuring that only distinct genetic variations contribute to machine learning predictions.<b>Lower identity clustering (e.g., 70-90%)</b> Groups more diverse sequences, which is helpful for broader strain classification or identifying distant homologs in resistance genes.CD-HIT results help refine machine learning predictions in <i>E. coli</i> genomic analysis by ensuring that training data remains diverse and balanced.
提供机构:
figshare
创建时间:
2025-05-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作