five

Supplementary Material 9

收藏
Figshare2025-05-12 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Supplementary_Material_9/28601087
下载链接
链接失效反馈
官方服务:
资源简介:
CD-HIT (Cluster Database at High Identity with Tolerance) is a widely used clustering algorithm that reduces redundancy in large genomic datasets. CD-HIT can group similar sequences or genomic features when applied to machine learning results in Escherichia coli genomic analysis, improving model efficiency and reducing computational complexity.CD-HIT in machine learning-based E. coli genomic analysis:Feature reduction: In supervised machine learning, CD-HIT can cluster similar sequences from genomic data, eliminating redundant information and improving feature selection for classifiers like Random Forest, XGBoost, and SVM.Antimicrobial resistance (AMR) analysis: By clustering resistance gene sequences at high identity thresholds (e.g., 90% or 95%), CD-HIT helps identify unique resistance patterns while reducing data redundancy.Virulence gene clustering: CD-HIT can be applied to virulence factor datasets to cluster homologous genes, aiding in classifying pathogenic E. coli strains.Improving model generalization: Reducing sequence redundancy prevents overfitting in machine learning models, leading to better generalization in predicting resistant and susceptible E. coli strains.Computational efficiency: CD-HIT optimizes data processing by clustering sequences before applying machine learning models, making large-scale genomic analyses more feasible.CD-HIT results interpretation:High identity clustering (e.g., 95-100%): Groups highly similar sequences, ensuring that only distinct genetic variations contribute to machine learning predictions.Lower identity clustering (e.g., 70-90%) Groups more diverse sequences, which is helpful for broader strain classification or identifying distant homologs in resistance genes.CD-HIT results help refine machine learning predictions in E. coli genomic analysis by ensuring that training data remains diverse and balanced.
创建时间:
2025-05-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作