SBDI Sativa curated 16S GTDB database
收藏DataCite Commons2025-05-07 更新2025-04-16 收录
下载链接:
https://figshare.scilifelab.se/articles/dataset/SBDI_Sativa_curated_16S_GTDB_database/14869077/8
下载链接
链接失效反馈官方服务:
资源简介:
The data in this [repository](https://doi.org/10.17044/scilifelab.14869077) is the result of vetting 16S sequences from the Genome Taxonomy Database (GTDB) release R09RS220 (r220) (https://gtdb.ecogenomic.org/; Parks et al. 2018) with the Sativa program (Kozlov et al. 2016) using the [sbdi-phylomarkercheck](https://github.com/biodiversitydata-se/sbdi-phylomarkercheck) Nextflow pipeline.Using Sativa [Kozlov et al. 2016], 16S sequences from GTDB were checked so that their phylogenetic signal is consistent with their taxonomy.Before calling Sativa, sequences longer than 2000 nucleotides or containing Ns were removed, and the reverse complement of each is calculated. Subsequently, sequences were aligned with HMMER [Eddy 2011] using the Barrnap [https://github.com/tseemann/barrnap] archaeal and bacterial 16S profiles respectively, and sequences containing more than 10% gaps were removed. The remaining sequences were analyzed with Sativa, and sequences that were not phylogenetically consistent with their taxonomy were removed.Files for the DADA2 (Callahan et al. 2016) methods `assignTaxonomy` and `addSpecies` are available, in three different versions each. The `assignTaxonomy` files contain taxonomy for domain, phylum, class, order, family, genus and species. (Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments from `assignTaxonomy` with caution.) The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "1genome", "5genomes" and "20genomes" in the file names respectively. Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the `addSpecies` algorithm, while using a file with many genomes per species could potentially give biases in the taxonomic annotations at higher levels by `assignTaxonomy`. Our recommendation is hence to use the "1genome" files for `assignTaxonomy` and "20genomes" for `addSpecies`.<br>All files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the `addSpecies` file have sequence identities and species names. There is also a fasta files with the original GTDB sequence names: sbdi-gtdb-sativa.r09rs220.20genomes.fna.gz.Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow from version 2.1: --dada_ref_taxonomy sbdi-gtdb (https://nf-co.re/ampliseq; Straub et al. 2020).In addition to the fasta files, the workflow outputs phylogenetic trees by optimizing branch-lengths of the original phylogenomic GTDB trees based on a 16S sequence alignment. As not all species in GTDB will have correct 16S sequences, the GTDB trees are first subset to contain only species for which the species representative genome has a correct 16S sequence. Subsequently, branch lengths for the tree are optimized based on the original alignment of 16S sequences using IQTREE [Nguyen et al. 2015] with a GTR+F+I+G4 model. The alignment files end with .alnfna, the taxonomy files with .taxonomy.tsv and the tree files (newick-formatted) end with .brlenopt.newick. They will be made available in nf-core/ampliseq for phylogenetic placement.The data will be updated circa yearly, after the GTDB database is updated.Version history<br>v8 (2025-02-18): Remove extra sequences from e.g. "1genome" files that appeared due to ties.v7 (2024-06-25): Update to GTDB R09-RS220 from R08-RS214.v6 (2024-04-24): Replace manual procedure with Nextflow pipeline. Update to GTDB R08-RS214 from R07-RS207.v5 (2022-10-07): Add missing fasta file with original GTDB names.v4 (2022-08-31): Update to GTDB R07-RS207 from R06-RS202Acknowledgements<br>The computations were enabled by resources in project [NAISS 2023/22-601, SNIC 2022/22-500 and SNIC 2021/22-263] provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at UPPMAX, funded by the Swedish Research Council through grant agreement no. 2022-06725.Computations were also enabled by resources provided by Dr. Maria Vila-Costa, Institute of Environmental Assessment and Water Research (IDAEA-CSIC), Barcelona.<br>
本数据集存储于https://doi.org/10.17044/scilifelab.14869077,其核心为针对基因组分类数据库(Genome Taxonomy Database, GTDB)R09RS220(r220)版本(https://gtdb.ecogenomic.org/; Parks et al. 2018)中的16S序列开展筛选后得到的结果,筛选流程通过[sbdi-phylomarkercheck](https://github.com/biodiversitydata-se/sbdi-phylomarkercheck) Nextflow工作流,使用Sativa程序(Kozlov et al. 2016)完成。
本数据集通过Sativa程序校验GTDB的16S序列,确保其系统发育信号与分类学信息保持一致。具体流程如下:在调用Sativa前,首先移除长度超过2000核苷酸或包含未知碱基(N)的序列,并计算每条序列的反向互补序列;随后,分别使用Barrnap(https://github.com/tseemann/barrnap)的古菌与细菌16S隐马尔可夫模型配置文件,通过HMMER软件(Eddy 2011)完成序列比对,并移除比对后空位占比超过10%的序列;最后,将剩余序列经Sativa分析,移除系统发育信号与分类学信息不一致的序列。
针对DADA2工具(Callahan et al. 2016)的`assignTaxonomy`与`addSpecies`函数,本数据集提供了各三种不同版本的配套文件。其中`assignTaxonomy`文件包含域、门、纲、目、科、属、种的完整分类层级注释。注:有研究提出,短16S序列的物种分类需满足100%序列一致性(Edgar 2018),因此使用`assignTaxonomy`输出的物种注释时需谨慎。三个版本的区别在于每个物种包含的最大基因组数量,分别为1、5或20,在文件名中以`1genome`、`5genomes`与`20genomes`标识。使用每个物种包含20个基因组的版本,可提升`addSpecies`算法匹配到完全一致序列的概率;但使用高物种基因组数量的文件,可能会让`assignTaxonomy`在高阶分类单元的注释中引入偏差。因此我们推荐:`assignTaxonomy`使用`1genome`版本的文件,`addSpecies`使用`20genomes`版本的文件。
所有文件均为gzip压缩的FASTA格式16S序列文件。`assignTaxonomy`配套文件包含从域到种的分类层级信息,而`addSpecies`文件则包含序列一致性与物种名称。另有一个包含原始GTDB序列名称的FASTA文件:`sbdi-gtdb-sativa.r09rs220.20genomes.fna.gz`。
使用该数据集进行16S扩增子的分类学注释,可作为nf-core/ampliseq Nextflow工作流(版本2.1及以上)的可选参数:`--dada_ref_taxonomy sbdi-gtdb`(https://nf-co.re/ampliseq; Straub et al. 2020)。
除FASTA文件外,该工作流还会基于16S序列比对结果,优化原始系统发育GTDB树的分支长度以生成适配的系统发育树。由于GTDB中并非所有物种的16S序列都准确,因此会先对GTDB树进行子集化,仅保留其代表性基因组带有准确16S序列的物种。随后,使用IQ-TREE软件(Nguyen et al. 2015)结合GTR+F+I+G4进化模型,基于原始16S序列比对结果优化树的分支长度。比对文件后缀为`.alnfna`,分类学注释文件后缀为`.taxonomy.tsv`,系统发育树文件(Newick格式)后缀为`.brlenopt.newick`。这些文件将随nf-core/ampliseq工作流一同提供,用于系统发育定位分析。
本数据集将在GTDB数据库更新后,约每年更新一次。
版本历史:
v8(2025-02-18):移除因序列匹配重复导致的`1genome`等文件中的额外序列。
v7(2024-06-25):从GTDB R08-RS214版本更新至GTDB R09-RS220版本。
v6(2024-04-24):将手动流程替换为Nextflow流程,从GTDB R07-RS207版本更新至GTDB R08-RS214版本。
v5(2022-10-07):新增包含原始GTDB序列名称的缺失FASTA文件。
v4(2022-08-31):从GTDB R06-RS202版本更新至GTDB R07-RS207版本。
致谢:
本研究的计算资源由瑞典国家超级计算学术基础设施(National Academic Infrastructure for Supercomputing in Sweden, NAISS)提供的项目[NAISS 2023/22-601, SNIC 2022/22-500及SNIC 2021/22-263]支持,该设施由瑞典研究理事会通过资助协议编号2022-06725资助。
部分计算资源由巴塞罗那环境评估与水研究研究所(IDAEA-CSIC)的Maria Vila-Costa博士提供。
提供机构:
Linnéuniversitetet
创建时间:
2025-02-18
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个经过Sativa程序严格筛选的16S序列数据库,源自GTDB R09RS220版本,适用于微生物分类学研究。它提供了多种格式的文件,支持DADA2方法进行物种分类和系统发育分析,并定期更新以保持数据的最新性。
以上内容由遇见数据集搜集并总结生成



