ProSynTaxDB: Prochlorococcus and Synechococcus Taxonomy Database
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14889680
下载链接
链接失效反馈官方服务:
资源简介:
INTRODUCTION:
Understanding the distribution of the abundant and closely related picocyanobacteria, Prochlorococcus and Synechococcus, is essential for understanding marine ecosystems. These organisms are highly diverse, making the accurate classification of clusters/clades/grades within each genus challenging. As a result, Prochlorococcus and Synechococcus populations are often characterized as a single strain, concealing the well-documented fine-scale niche partitioning within these groups (Johnson, et al., 2006, Hunter-Cevera, et al., 2016, Larkin, et al., 2016, Kent, et al., 2019, Thompson, et al., 2021, Ustick et al., 2023). Here, we introduce ProSynTaxDB and its associated workflow, designed to significantly improve metagenomic classification of Prochlorococcus and Synechococcus using a substantial amount of high-quality genomic reference data collected over the past decade and from this study. ProSynTaxDB includes 1,260 single-cell amplified genomes, high-quality draft cultured genomes, and unpublished closed genomes, featuring new closed circular assemblies for 40 Prochlorococcus, 12 Synechococcus, and 10 marine heterotrophic bacterial strains. This includes 21 Prochlorococcus genomes that were previously partially assembled (Biller et al., 2014) and 16 genomes from unpublished isolates. Additionally, the database includes 27,799 genomes of marine heterotrophic bacteria, archaea, and viruses to assess communities surrounding Prochlorococcus and Synechococcus. ProSynTaxDB and the accompanying workflow can accurately identify clades in metagenomic samples containing at least 0.60% Prochlorococcus reads or 0.09% Synechococcus reads, thereby improving our understanding of these picocyanobacteria in low-abundance regions.
Github to the associated workflow: https://github.com/jamesm224/ProSynTaxDB-workflow
FILE DESCRIPTION:
ProSynTaxDB_genomes.tsv
Table of genomes included in the ProSynTaxDB and their associated metadata (Data Citation 1). Data fields are as follows:
organism: The name of the organism recorded in NCBI when available. For genomes/organisms obtained from sources other than NCBI, the organism name is provided in NCBI format
genome_short_name: The genome name used in the ProSynTaxDB
domain: Bacteria, Archaea, Eukarya, or Virus
genus: The genus of the organism in NCBI
clade: The major cluster/clade/grade of Prochlorococcus or Synechococcus based on phylogenetic reconstruction using a concatenated alignment of proteins encoded by single-copy core genes
NCBI_BioProject: The NCBI BioProject accession number associated with the organism, when available
NCBI_BioSample: The NCBI BioSample accession number associated with the organism, when available
NCBI_GenBank: The NCBI GenBank accession number associated with the genome sequence data, when available
IMG_Genome_ID: The IMG Genome ID accession number, when available, associated with the genome/organism in the Joint Genome Institute’s (JGI) Integrated Microbial Genomes (IMG) repository. The IMG Genome ID is synonymous with the IMG Taxon ID
ProSynTaxDB_names.dmp
Names taxonomy file for use with the ProSynTaxDB.
ProSynTaxDB_nodes.dmp
Nodes taxonomy file for use with the ProSynTaxDB.
ProSynTaxDB.fmi
Index file containing contents of ProSynTaxDB_v1.faa for use with ProSynTaxDB.
CyCOG6.dmnd
Database containing orthologous groups of proteins used in the cluster/clade/grade normalization step.
ProSynTaxDB.faa
File containing protein sequences used by Kaiju for classification of reads. Each protein sequence contains a header starting with “>”.
average_cycog_length.csv
Comma separated file containing the average length for each protein sequence used in the normalization step. Data fields are as follows: cycog: name for single-copy core gene
mean_AA_length: the average length of amino acids in the protein sequence of the gene
ProSynTaxDB-workflow_benchmarking_genomes.tsv
This tab-delimited file contains a list of subsetted genomes used in each benchmarking experiment done in the Technical Validation section. Data fields are as follows:
Experiment Name: name of benchmarking experiment conducted
Subset ID: unique ID from the random genome subsetting
Genome Name: name of genome used in benchmarking experiment
ProSynTaxDB-workflow_benchmarking_composition.tsv
This tab-delimited file contains the taxon composition of all samples used in each benchmarking experiment done in the Technical Validation section. Data fields are as follows:
Experiment Name: name of benchmarking experiment conducted
Sample Name: unique sample name
Percent Prochlorococcus: percent of reads in simulated sample originating from Prochlorococcus genomes
Percent Synechococcus: percent of reads in simulated sample originating from Synechococcus genomes
Percent Heterotroph: percent of reads in simulated sample originating from marine heterotrophic bacterial genomes
Notes: additional information about the simulated sample
创建时间:
2025-03-18



