ProSynTaxDB: Prochlorococcus and Synechococcus Taxonomy Database

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14889680

下载链接

链接失效反馈

官方服务：

资源简介：

INTRODUCTION: Understanding the distribution of the abundant and closely related picocyanobacteria, Prochlorococcus and Synechococcus, is essential for understanding marine ecosystems. These organisms are highly diverse, making the accurate classification of clusters/clades/grades within each genus challenging. As a result, Prochlorococcus and Synechococcus populations are often characterized as a single strain, concealing the well-documented fine-scale niche partitioning within these groups (Johnson, et al., 2006, Hunter-Cevera, et al., 2016, Larkin, et al., 2016, Kent, et al., 2019, Thompson, et al., 2021, Ustick et al., 2023). Here, we introduce ProSynTaxDB and its associated workflow, designed to significantly improve metagenomic classification of Prochlorococcus and Synechococcus using a substantial amount of high-quality genomic reference data collected over the past decade and from this study. ProSynTaxDB includes 1,260 single-cell amplified genomes, high-quality draft cultured genomes, and unpublished closed genomes, featuring new closed circular assemblies for 40 Prochlorococcus, 12 Synechococcus, and 10 marine heterotrophic bacterial strains. This includes 21 Prochlorococcus genomes that were previously partially assembled (Biller et al., 2014) and 16 genomes from unpublished isolates. Additionally, the database includes 27,799 genomes of marine heterotrophic bacteria, archaea, and viruses to assess communities surrounding Prochlorococcus and Synechococcus. ProSynTaxDB and the accompanying workflow can accurately identify clades in metagenomic samples containing at least 0.60% Prochlorococcus reads or 0.09% Synechococcus reads, thereby improving our understanding of these picocyanobacteria in low-abundance regions. Github to the associated workflow: https://github.com/jamesm224/ProSynTaxDB-workflow FILE DESCRIPTION: ProSynTaxDB_genomes.tsv Table of genomes included in the ProSynTaxDB and their associated metadata (Data Citation 1). Data fields are as follows: organism: The name of the organism recorded in NCBI when available. For genomes/organisms obtained from sources other than NCBI, the organism name is provided in NCBI format genome_short_name: The genome name used in the ProSynTaxDB domain: Bacteria, Archaea, Eukarya, or Virus genus: The genus of the organism in NCBI clade: The major cluster/clade/grade of Prochlorococcus or Synechococcus based on phylogenetic reconstruction using a concatenated alignment of proteins encoded by single-copy core genes NCBI_BioProject: The NCBI BioProject accession number associated with the organism, when available NCBI_BioSample: The NCBI BioSample accession number associated with the organism, when available NCBI_GenBank: The NCBI GenBank accession number associated with the genome sequence data, when available IMG_Genome_ID: The IMG Genome ID accession number, when available, associated with the genome/organism in the Joint Genome Institute’s (JGI) Integrated Microbial Genomes (IMG) repository. The IMG Genome ID is synonymous with the IMG Taxon ID ProSynTaxDB_names.dmp Names taxonomy file for use with the ProSynTaxDB. ProSynTaxDB_nodes.dmp Nodes taxonomy file for use with the ProSynTaxDB. ProSynTaxDB.fmi Index file containing contents of ProSynTaxDB_v1.faa for use with ProSynTaxDB. CyCOG6.dmnd Database containing orthologous groups of proteins used in the cluster/clade/grade normalization step. ProSynTaxDB.faa File containing protein sequences used by Kaiju for classification of reads. Each protein sequence contains a header starting with “>”. average_cycog_length.csv Comma separated file containing the average length for each protein sequence used in the normalization step. Data fields are as follows: cycog: name for single-copy core gene mean_AA_length: the average length of amino acids in the protein sequence of the gene ProSynTaxDB-workflow_benchmarking_genomes.tsv This tab-delimited file contains a list of subsetted genomes used in each benchmarking experiment done in the Technical Validation section. Data fields are as follows: Experiment Name: name of benchmarking experiment conducted Subset ID: unique ID from the random genome subsetting Genome Name: name of genome used in benchmarking experiment ProSynTaxDB-workflow_benchmarking_composition.tsv This tab-delimited file contains the taxon composition of all samples used in each benchmarking experiment done in the Technical Validation section. Data fields are as follows: Experiment Name: name of benchmarking experiment conducted Sample Name: unique sample name Percent Prochlorococcus: percent of reads in simulated sample originating from Prochlorococcus genomes Percent Synechococcus: percent of reads in simulated sample originating from Synechococcus genomes Percent Heterotroph: percent of reads in simulated sample originating from marine heterotrophic bacterial genomes Notes: additional information about the simulated sample

创建时间：

2025-03-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集