five

Supplementary Information: CHAPTER 3 - Classification of genomic features of plant-associated bacteria using machine learning

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10447437
下载链接
链接失效反馈
官方服务:
资源简介:
Appendix A-  List of all bacterial genomes used in orthologous  genes clustering in the feature extraction step and in the further steps to build and test classifiers’ models. The list includes the isolation source information and the  related category for the genome classification and features selection purposes. Appendix B - Distribution of genomes by phylum, family, and genus  among the categories defined according to bacteria lifestyle association. Appendix C - Enriched orthogroups by genus according to each enrichment test (Material and Methods). Values for each test are "Y" (enriched), "N" (not enriched), or "Untested" (clusters were untested when there was insufficient phylogenetic signal, they were too small or were found in all genomes). Appendix D - Classification performance of  random forest and logistic regression techniques applied to genus-specific datasets of genomic features (orthogroups) using both matrices from gene count number and presence/absence values. Sensitivity is a measure of how well a test identifies true positives; Specificity: is a measure how well a test or model avoids false positives; Positive Predictive Value (Pos. Pred. Value): The probability that a positive prediction is correct; Negative Predictive Value (Neg. Pred. Value): The probability that a negative prediction is correct; Precision: The accuracy of positive predictions; Recall (Sensitivity): The ability to find all relevant cases; F1 Score: A combined measure of precision and recall; Prevalence: The proportion of positive cases in the total; Detection Rate: The proportion of true positive cases identified; Detection Prevalence: The proportion of positive predictions; Balanced Accuracy: An average of sensitivity and specificity;  Area Under the Curve (AUC): The overall performance of the model in distinguishing between positive and negative cases. Appendix E - Orthogroups assigned with predicted COGs as an important feature for classifying plant-associated genomes. COG categories: A - RNA processing and modification; B - Chromatin structure and dynamics; C - Energy production and conversion; D - Cell cycle control, cell division, chromosome partitioning; E - Amino acid transport and metabolism; F - Nucleotide transport and metabolism; G - Carbohydrate transport and metabolism; H - Coenzyme transport and metabolism; I - Lipid transport and metabolism; J - Translation, ribosomal structure and biogenesis; K - Transcription; L - Replication, recombination and repair; M - Cell wall/membrane/envelope biogenesis; N - Cell motility; O - Posttranslational modification, protein turnover, chaperones; P - Inorganic ion transport and metabolism; Q - Secondary metabolites biosynthesis, transport and catabolism; R - General function prediction only; S - Function unknown; T - Signal transduction mechanisms; U - Intracellular trafficking, secretion, and vesicular transport; V - Defense mechanisms; W - Extracellular structures; X - Mobilome: prophages, transposons; Y - Nuclear structure; Z - Cytoskeleton.
创建时间:
2023-12-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作