Additional file 1 of Comparative pangenomics: analysis of 12 microbial pathogen pangenomes reveals conserved global structures of genetic and functional diversity

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://figshare.com/articles/dataset/Additional_file_1_of_Comparative_pangenomics_analysis_of_12_microbial_pathogen_pangenomes_reveals_conserved_global_structures_of_genetic_and_functional_diversity/17870487

下载链接

链接失效反馈

官方服务：

资源简介：

Additional file 1: Figure S1. Phylogenetic tree of selected species and MLST subtype distributions. a) Phylogenetic tree constructed for representative genomes of each species using PATRIC’s Codon Tree service. Genomes are labeled by their name and PATRIC Genome ID. b) Distribution of MLST subtypes for each species’ genome collection. The relative abundance of the top 5 MLST subtypes, all other subtypes, and untyped genomes are shown per species. Figure S2. Evaluation of accuracy of Heaps’ Law at predicting pangenome size, with or without controlling from MLST. a) Example fit of Heaps’ Law to first half of genomes (unbalanced) or MLSTs (MLST balanced) and extrapolation to second half to evaluate pangenome size projection. b-c) Median mean absolute error (MAE) across Heaps’ Law fits for 100 random genome orderings, with or without MLST balancing, for each species in the fitting and extrapolation regions. Dotted lines indicate equal performance between the two methods. Figure S3. Gene frequency distributions by species. Figure S4. Fitted cumulative gene frequency distributions and corresponding core and unique gene frequency thresholds, by species. Observed distributions (solid blue), fitted functions (dashed orange), and the R2 and mean absolute errors (MAE) of the fits are shown. Fitted inflection points (black dot, dashed gray) and frequency thresholds corresponding to core and unique genes (dashed black) are also shown. Figure S5. COG functional group enrichment in the core, accessory, and unique genomes of 12 species. Heatmaps are colored by the log2 odds ratio (LOR) between each COG and the a) core, b) accessory, c) unique genome of each species. COGs are sorted by mean LOR across all species. LOR color scales are symmetric and identical for all plots; four values are outside of the color range: F x E. cloacae (LOR = − 7.5), Q x C. coli (LOR = − 6.0), and K x C. coli (LOR = − 6.9) for accessory genomes; F x E. faecium (LOR = − 5.3) for unique genomes. Starred cells correspond to statistically significant enrichments under Fisher’s Exact test with FWER < 0.05 under Bonferroni correction (p < 7*10− 5, 720 tests). Figure S6. Top 10 GO terms by enrichment in the core, accessory, and unique genomes of 12 species. Heatmaps are colored by the log2 odds ratio (LOR) between each GO term and the a) core, b) accessory, or c) unique genome of each species. GO terms are sorted by mean LOR across all species. LOR color scales are identical for all plots. Starred cells correspond to statistically significant enrichments under Fisher’s Exact test with FWER < 0.05 under Bonferroni correction (p < 3*10− 6, 14,904 tests). Figure S7. Quantile regression between coding allelic entropy and gene length among core genes, by species. Dotted lines show quantile regressions for the 5 and 95% coding allelic entropy percentiles as a quadratic function of gene length. Red and blue dots are the most diverse and most conserved core genes, respectively, as defined by these regressions. Figure S8. Rolling window percentiles versus quantile regression between coding allelic entropy and gene length among core genes, by species. Dotted lines show quantile regressions for the 5 and 95% coding allelic entropy percentiles as a quadratic function of gene length. Orange lines show rolling 5 and 95% percentiles using windows of 50 genes. Figure S9. Coding allelic entropies of genes used in MLST typing schemes, as percentiles among all core genes of the corresponding species. For A. baumannii, the MLST gene gpi was mapped to two pangenome gene clusters denoted gpi-1 and gpi-2, both of which include gpi variants defined in the A. baumannii MLST typing scheme. Figure S10. Domains with significant mutation depletion across multiple species. Species-specific mutation depletion for gene-domain pairs with significant multispecies mutation depletion (Bootstrap test, FDR < 0.05, Benjamini-Hochberg correction). Domains related to aminoacyl-tRNA synthetases are labeled purple. White cells correspond to domains that could not be annotated within the species’ consensus sequence of the parent protein. Table S1. Genome counts, abbreviations, and taxon IDs for species examined. Table S2. Heaps’ Law parameter estimates, fitted by either randomly shuffling all genomes “by genome” or one genome per MLST “by MLST.” Means and standard deviations from 100 iterations are shown for each species, parameter, and method. Species are sorted by Heaps’ Law alpha, estimated using the MLST method. Table S3. Evaluating accuracy of Heaps’ Law fits, based on either randomly shuffling all genomes “by genome” or one genome per MLST “by MLST.” Heaps’ Law was fit to the first half of genomes in pangenome size curves (“fitting region”) generated by either method and accuracy was evaluated against the second half (“extrapolation region”). The mean absolute error (MAE) for each region was computed, and the median MAE across 100 iterations is shown, as well as relative error between the MLST vs genome methods. Species are sorted by relative median MAE in the extrapolation region. Table S4. Gene frequency cutoffs and gene counts for the core, accessory, and unique genomes of 12 species. Table S5. Correlations between three types of intraspecies sequence diversity for core genes. Variant types are coding (protein sequences), 5′ intergenic (5′ IG, 300 nt upstream and adjacent to the start codon), and 3′ intergenic (3′ IG, 300 nt downstream and adjacent to the stop codon). Dataset S1. PATRIC genome IDs for all genomes used. Dataset S2. MLST annotations generated with https://github.com/tseemann/mlst for all genomes. Dataset S3. Summary of double power function fits to cumulative gene frequency distributions and derived thresholds for classifying genes as core, accessory, or unique. Includes for each species the minimum frequency to classify a gene as core, maximum frequency to classify a gene as unique, sizes of the core/accessory/unique genomes, R2 and MAE of the fit, and the five fitted parameters. Dataset S4. Log odd ratios and Fisher’s exact test p-values for enrichment between gene functional groups (COGs, GO terms) and various gene categories (core, accessory, unique, highest sequence diversity, lowest sequence diversity) within each species. Contains raw data for heatmaps and boxplots in Fig. 3c, Fig. 5b, Fig. S5, and Fig. S6. Dataset S5. Predicted gene names, COG functional categories, and TraDIS E. coli essentiality predictions from Goodall et.al. 2018 for the 168 genes observed in the core genome of all 12 species. Dataset S6. Domain mutation enrichment analysis. For each gene-domain pair, includes the estimated mutation enrichment as domain entropy percentile (species-specific and species-wide averages), Bootstrap test p-values, domain InterPro accession IDs, and domain descriptions. Also includes assignment of functional categories to AARS-related domains.

创建时间：

2022-01-04