Supplementary Information: CHAPTER 3 - Classification of genomic features of plant-associated bacteria using machine learning

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10447437

下载链接

链接失效反馈

官方服务：

资源简介：

Appendix A- List of all bacterial genomes used in orthologous genes clustering in the feature extraction step and in the further steps to build and test classifiers’ models. The list includes the isolation source information and the related category for the genome classification and features selection purposes. Appendix B - Distribution of genomes by phylum, family, and genus among the categories defined according to bacteria lifestyle association. Appendix C - Enriched orthogroups by genus according to each enrichment test (Material and Methods). Values for each test are "Y" (enriched), "N" (not enriched), or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). Appendix D - Classification performance of random forest and logistic regression techniques applied to genus-specific datasets of genomic features (orthogroups) using both matrices from gene count number and presence/absence values. Sensitivity is a measure of how well a test identifies true positives; Specificity: is a measure how well a test or model avoids false positives; Positive Predictive Value (Pos. Pred. Value): The probability that a positive prediction is correct; Negative Predictive Value (Neg. Pred. Value): The probability that a negative prediction is correct; Precision: The accuracy of positive predictions; Recall (Sensitivity): The ability to find all relevant cases; F1 Score: A combined measure of precision and recall; Prevalence: The proportion of positive cases in the total; Detection Rate: The proportion of true positive cases identified; Detection Prevalence: The proportion of positive predictions; Balanced Accuracy: An average of sensitivity and specificity; Area Under the Curve (AUC): The overall performance of the model in distinguishing between positive and negative cases. Appendix E - Orthogroups assigned with predicted COGs as an important feature for classifying plant-associated genomes. COG categories: A - RNA processing and modification; B - Chromatin structure and dynamics; C - Energy production and conversion; D - Cell cycle control, cell division, chromosome partitioning; E - Amino acid transport and metabolism; F - Nucleotide transport and metabolism; G - Carbohydrate transport and metabolism; H - Coenzyme transport and metabolism; I - Lipid transport and metabolism; J - Translation, ribosomal structure and biogenesis; K - Transcription; L - Replication, recombination and repair; M - Cell wall/membrane/envelope biogenesis; N - Cell motility; O - Posttranslational modification, protein turnover, chaperones; P - Inorganic ion transport and metabolism; Q - Secondary metabolites biosynthesis, transport and catabolism; R - General function prediction only; S - Function unknown; T - Signal transduction mechanisms; U - Intracellular trafficking, secretion, and vesicular transport; V - Defense mechanisms; W - Extracellular structures; X - Mobilome: prophages, transposons; Y - Nuclear structure; Z - Cytoskeleton.

创建时间：

2023-12-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集