Putative genome contamination has minimal impact on the GTDB taxonomy
收藏Research Data Australia2024-12-14 收录
下载链接:
https://researchdata.edu.au/putative-genome-contamination-gtdb-taxonomy/3368058
下载链接
链接失效反馈官方服务:
资源简介:
These data are the generated output of the pipeline hosted here: https://github.com/aaronmussig/impact-of-contamination-on-taxonomy Identifying putatively contaminated genomes Nucleotide files for the 317,542 genomes that passed quality control in GTDB release 07-RS207 were obtained from the NCBI Assembly database release 207. Genes were called for each genome using Prodigal v2.6.3 according to translation tables 4 or 11 as specified in the GTDB metadata files (https://data.gtdb.ecogenomic.org/releases/release207/207.0/). Called genes for each genome were processed with GUNC v1.0.5 using the GTDB 05-RS95 and ProGenomes 2.1 DIAMOND reference databases provided with the GUNC software. GUNC results for these two databases were merged by inclusion of all failed genomes. If a genome was identified as contaminated with both reference databases, the result with the highest clade separation score (CSS) was used as the worst case scenario for downstream analyses. Calculation of contig contamination scores Since GUNC does not provide a contamination score for individual contigs in a draft genome, we developed a contig-based scoring system for GUNC-failed genomes (Figure 1). First, a taxonomic assignment was determined for each genome by taking the most commonly inferred taxon at each rank across the GUNC-provided assignments across all genes. This genome-specific taxonomic assignment was then used as a reference taxonomy for establishing the classification congruence of each contig. The taxonomic assignment of each contig was determined by taking the majority vote at each rank of the GUNC-provided closest DIAMOND match to each gene on the contig. In the rare case of ties, the tied rank was not considered. The taxonomic assignment was then truncated to the genome-specific rank at which the largest CSS occurs, as identified by GUNC. For each contig, the proportion of genes that had a congruent assignment with the truncated rank specified above was determined. The contigs were then ordered by how much the contig-specific taxonomic assignment deviates from the genome-specific taxonomic assignment, from domain to the rank at which the largest CSS occurs. The greater the deviation the higher the contamination score (Figure 1). Average nucleotide identity (ANI)-based analysis of clean genome halves The clean halves of all failed genomes were assigned to their closest GTDB species representative in release GTDB 07-RS207 using a combination of Mash and FastANI. A reference database was first created using Mash v2.3 comprising the 62,291 bacterial species representatives in GTDB 07-RS207. All species representatives with >80% ANI to each cleaned half were identified using the Mash parameters k = 16, s = 5000, d = 0.2, and v = 1.0. A second set of species representatives were obtained for each failed genome using their original classification, anticipating the possible movement of clean halves from their original position in the reference tree. This included all species representatives from the expected genus, and if the list of additional genomes was
提供机构:
The University of Queensland



