Putative genome contamination has minimal impact on the GTDB taxonomy

Name: Putative genome contamination has minimal impact on the GTDB taxonomy
Creator: The University of Queensland
Published: 2026-01-21 02:22:47
License: 暂无描述

DataCite Commons2026-01-21 更新2024-07-13 收录

下载链接：

https://espace.library.uq.edu.au/view/UQ:85c83e3

下载链接

链接失效反馈

官方服务：

资源简介：

These data are the generated output of the pipeline hosted here: https://github.com/aaronmussig/impact-of-contamination-on-taxonomy Identifying putatively contaminated genomes Nucleotide files for the 317,542 genomes that passed quality control in GTDB release 07-RS207 were obtained from the NCBI Assembly database release 207. Genes were called for each genome using Prodigal v2.6.3 according to translation tables 4 or 11 as specified in the GTDB metadata files (https://data.gtdb.ecogenomic.org/releases/release207/207.0/). Called genes for each genome were processed with GUNC v1.0.5 using the GTDB 05-RS95 and ProGenomes 2.1 DIAMOND reference databases provided with the GUNC software. GUNC results for these two databases were merged by inclusion of all failed genomes. If a genome was identified as contaminated with both reference databases, the result with the highest clade separation score (CSS) was used as the worst case scenario for downstream analyses. Calculation of contig contamination scores Since GUNC does not provide a contamination score for individual contigs in a draft genome, we developed a contig-based scoring system for GUNC-failed genomes (Figure 1). First, a taxonomic assignment was determined for each genome by taking the most commonly inferred taxon at each rank across the GUNC-provided assignments across all genes. This genome-specific taxonomic assignment was then used as a reference taxonomy for establishing the classification congruence of each contig. The taxonomic assignment of each contig was determined by taking the majority vote at each rank of the GUNC-provided closest DIAMOND match to each gene on the contig. In the rare case of ties, the tied rank was not considered. The taxonomic assignment was then truncated to the genome-specific rank at which the largest CSS occurs, as identified by GUNC. For each contig, the proportion of genes that had a congruent assignment with the truncated rank specified above was determined. The contigs were then ordered by how much the contig-specific taxonomic assignment deviates from the genome-specific taxonomic assignment, from domain to the rank at which the largest CSS occurs. The greater the deviation the higher the contamination score (Figure 1). Average nucleotide identity (ANI)-based analysis of clean genome halves The clean halves of all failed genomes were assigned to their closest GTDB species representative in release GTDB 07-RS207 using a combination of Mash and FastANI. A reference database was first created using Mash v2.3 comprising the 62,291 bacterial species representatives in GTDB 07-RS207. All species representatives with >80% ANI to each cleaned half were identified using the Mash parameters k = 16, s = 5000, d = 0.2, and v = 1.0. A second set of species representatives were obtained for each failed genome using their original classification, anticipating the possible movement of clean halves from their original position in the reference tree. This included all species representatives from the expected genus, and if the list of additional genomes was <100, then all species representatives were included from the ranks of family up to phylum until at least 100 additional genomes were added using this method. FastANI v1.3 was run bidirectionally (-q and -r) against the union of the two sets of species representatives obtained for each failed genome to identify the closest match, taking the maximum ANI and alignment fraction (AF) as one result. Self-hits were excluded for failed genomes that were species representatives in GTDB 07-RS207 and the next best match was then considered. Based on the closest FastANI hit, each failed genome was assigned to one of four categories using the species assignment criteria of ≥95% ANI and ≥0.5 AF: i) same species cluster when the closest hit by ANI was the expected species representative and the species assignment criteria was satisfied, ii) changed species cluster when a new representative genome was closest and the species assignment criteria was satisfied, and iii) new species cluster when the closest hit by ANI did not satisfy the species assignment criteria. To estimate background noise of the ANI analysis, we selected equivalently sized nucleotide subsets of genomes in GTDB 07-RS207 that passed the GUNC analysis. A random ordering of contigs was generated for each passed genome and the first half of the genome was retained for analysis. Datasets were then analyzed as per the clean half dataset. This was repeated 10 times to determine the baseline distribution of taxonomic changes on passed genomes. Tree-based analysis of clean genome halves Aligned marker genes from the clean halves of the 4,525 failed genomes that are used as representatives of species clusters in GTDB 07-RS207 were obtained from the GTDB website (https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_reps/bac120_msa_marker_genes_reps_r207.tar.gz). For each genome the clean set of marker genes were concatenated and aligned to the bac120 multiple sequence alignment (MSA). These 4,525 alignments were then substituted for their original alignments in the GTDB 07-RS207 bacterial species representative MSA comprising 62,291 sequences. The modified MSA was then masked using the standard GTDB filter (https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/bac120_msa_mask_r207.txt), and a maximum likelihood tree inferred with FastTree v2.1.10 using the WAGmodel. The tree was bootstrapped 100 times using GenomeTreeTk v0.1.8 and decorated with the GTDB 07-RS207 bacterial taxonomy using PhyloRank v0.1.12. Congruency of GTDB classifications between clean and original failed genomes was scored by comparing the taxonomy string of each genome derived from the inferred tree against its original GTDB 07-RS207 classification. This was performed from phylum to genus, with the highest ranked incongruent name being recorded. All incongruencies were manually checked by comparison of tree topologies in ARB. Identification of grossly contaminated genomes Genomes with contamination from distantly related organisms belonging to different families or higher ranks were identified by processing contaminated halves of all failed genomes using ANI- and tree-based analysis as described above. The taxonomic strings for the clean and contaminated halves of each genome were compared to each other as described above to estimate the degree of contamination.

提供机构：

The University of Queensland

创建时间：

2023-12-20