SBDI Sativa curated 16S GTDB database
收藏DataCite Commons2025-11-03 更新2025-05-18 收录
下载链接:
https://figshare.scilifelab.se/articles/dataset/SBDI_Sativa_curated_16S_GTDB_database/14869077/9
下载链接
链接失效反馈官方服务:
资源简介:
The data in this [repository](https://doi.org/10.17044/scilifelab.14869077) is the result of vetting 16S sequences from the Genome Taxonomy Database (GTDB) release R10RS226 (r226) (https://gtdb.ecogenomic.org/; Parks et al. 2018) with the Sativa program (Kozlov et al. 2016) using the [sbdi-phylomarkercheck](https://github.com/biodiversitydata-se/sbdi-phylomarkercheck) Nextflow pipeline.Using Sativa [Kozlov et al. 2016], 16S sequences from GTDB were checked so that their phylogenetic signal is consistent with their taxonomy.Before calling Sativa, sequences longer than 2000 nucleotides or containing Ns were removed, and the reverse complement of each is calculated. Subsequently, sequences were aligned with HMMER [Eddy 2011] using the Barrnap [https://github.com/tseemann/barrnap] archaeal and bacterial 16S profiles respectively, and sequences containing more than 10% gaps were removed. The remaining sequences were analyzed with Sativa, and sequences that were not phylogenetically consistent with their taxonomy were removed.Files for the DADA2 (Callahan et al. 2016) methods `assignTaxonomy` and `addSpecies` are available, in three different versions each. The `assignTaxonomy` files contain taxonomy for domain, phylum, class, order, family, genus and species. (Note that it has been proposed that species assignment for short 16S sequences require 100% identity (Edgar 2018), so use species assignments from `assignTaxonomy` with caution.) The versions differ in the maximum number of genomes that we included per species: 1, 5 or 20, indicated by "1genome", "5genomes" and "20genomes" in the file names respectively. Using the version with 20 genomes per species should increase the chances to identify an exactly matching sequence by the `addSpecies` algorithm, while using a file with many genomes per species could potentially give biases in the taxonomic annotations at higher levels by `assignTaxonomy`. Our recommendation is hence to use the "1genome" files for `assignTaxonomy` and "20genomes" for `addSpecies`.The fasta files are gzipped fasta files with 16S sequences, the assignTaxonomy associated with taxonomy hierarchies from domain to species whereas the `addSpecies` file have sequence identities and species names. There is also a fasta files with the original GTDB sequence names: sbdi-gtdb-sativa.r09rs220.20genomes.fna.gz.Taxonomical annotation of 16S amplicons using this data is available as an optional argument to the nf-core/ampliseq Nextflow workflow: --dada_ref_taxonomy sbdi-gtdb (https://nf-co.re/ampliseq; Straub et al. 2020).In addition to the fasta files, the workflow outputs phylogenetic trees by optimizing branch-lengths of the original phylogenomic GTDB trees based on a 16S sequence alignment. As not all species in GTDB will have correct 16S sequences, the GTDB trees are first subset to contain only species for which the species representative genome has a correct 16S sequence. Subsequently, branch lengths for the tree are optimized based on the original alignment of 16S sequences using IQTREE [Nguyen et al. 2015] with a GTR+F+I+G4 model. The alignment files end with .alnfna, the taxonomy files with .taxonomy.tsv and the tree files (newick-formatted) end with .brlenopt.newick. They will be made available in nf-core/ampliseq for phylogenetic placement.The data will be updated circa yearly, after the GTDB database is updated.Version history<br>v10 (2025-04-30): Update versions in this textv9 (2025-04-29): Update to GTDB R10-RS226v8 (2025-02-18): Remove extra sequences from e.g. "1genome" files that appeared due to ties.v7 (2024-06-25): Update to GTDB R09-RS220 from R08-RS214.v6 (2024-04-24): Replace manual procedure with Nextflow pipeline. Update to GTDB R08-RS214 from R07-RS207.v5 (2022-10-07): Add missing fasta file with original GTDB names.v4 (2022-08-31): Update to GTDB R07-RS207 from R06-RS202Acknowledgements<br>The computations were enabled by resources in project [NAISS 2023/22-601, SNIC 2022/22-500 and SNIC 2021/22-263] provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at UPPMAX, funded by the Swedish Research Council through grant agreement no. 2022-06725.Computations were also enabled by resources provided by Dr. Maria Vila-Costa, Institute of Environmental Assessment and Water Research (IDAEA-CSIC), Barcelona.<br>
本数据集存储于[仓库](https://doi.org/10.17044/scilifelab.14869077),其数据源自对基因组分类数据库(Genome Taxonomy Database, GTDB)R10RS226(r226)版本(https://gtdb.ecogenomic.org/; Parks et al. 2018)中的16S序列,通过[sbdi-phylomarkercheck](https://github.com/biodiversitydata-se/sbdi-phylomarkercheck) Nextflow流水线,使用Sativa程序(Kozlov et al. 2016)进行校验筛选得到的结果。使用Sativa程序(Kozlov et al. 2016)对GTDB中的16S序列进行校验,确保其系统发育信号与分类学信息一致。在运行Sativa前,需移除长度超过2000 nt或包含未知碱基N的序列,并计算每条序列的反向互补序列。随后,分别使用Barrnap(https://github.com/tseemann/barrnap)的古菌与细菌16S隐马尔可夫模型(HMM)剖面,通过HMMER[Eddy 2011]完成序列比对,并移除含比对间隙比例超过10%的序列。将剩余序列输入Sativa进行分析,移除系统发育信号与分类学信息不符的序列。本数据集提供了适用于DADA2(Callahan et al. 2016)工具`assignTaxonomy`与`addSpecies`函数的文件,各包含3个不同版本。`assignTaxonomy`文件包含域、门、纲、目、科、属、种的分类学注释。(注:有研究提出,短16S序列的物种分类需满足100%序列同一性(Edgar 2018),因此使用`assignTaxonomy`的物种注释结果时需谨慎。)三个版本的差异在于每个物种所包含的最大基因组数目:分别为1、5或20,在文件名中以`1genome`、`5genomes`与`20genomes`标识。使用每个物种包含20个基因组的版本,可提升`addSpecies`算法匹配到完全一致序列的概率;而使用包含过多基因组的文件,可能会导致`assignTaxonomy`在高级分类单元的注释中引入偏差。因此我们建议,`assignTaxonomy`使用`1genome`版本的文件,`addSpecies`使用`20genomes`版本的文件。本数据集的FASTA文件为gzip压缩的16S序列FASTA文件,其中与`assignTaxonomy`配套的文件包含从域到种的分类层级信息,而`addSpecies`文件则包含序列同一性与物种名称。另有一份包含原始GTDB序列名称的FASTA文件:`sbdi-gtdb-sativa.r09rs220.20genomes.fna.gz`。使用本数据集进行16S扩增子的分类学注释,可作为nf-core/ampliseq Nextflow工作流的可选参数:`--dada_ref_taxonomy sbdi-gtdb`(https://nf-co.re/ampliseq; Straub et al. 2020)。除FASTA文件外,该工作流还会输出系统发育树,其通过基于16S序列比对优化原始系统发育GTDB树的分支长度得到。由于GTDB中并非所有物种的16S序列均正确,因此首先会将GTDB树进行子集化,仅保留其代表性基因组含有正确16S序列的物种。随后,使用IQ-TREE[Nguyen et al. 2015]与GTR+F+I+G4模型,基于16S序列的原始比对结果优化树的分支长度。比对文件后缀为`.alnfna`,分类学注释文件后缀为`.taxonomy.tsv`,系统发育树文件(Newick格式)后缀为`.brlenopt.newick`。这些文件将在nf-core/ampliseq中开放获取,用于系统发育放置分析。本数据集将在GTDB数据库更新后,大约每年更新一次。版本更新记录<br>v10(2025-04-30):更新本文档中的版本信息<br>v9(2025-04-29):更新至GTDB R10-RS226版本<br>v8(2025-02-18):修复因序列匹配重复导致的`1genome`等文件中额外序列的问题<br>v7(2024-06-25):从GTDB R08-RS214版本更新至R09-RS220版本<br>v6(2024-04-24):将手动流程替换为Nextflow流水线,从GTDB R07-RS207版本更新至R08-RS214版本<br>v5(2022-10-07):新增包含原始GTDB序列名称的缺失FASTA文件<br>v4(2022-08-31):从GTDB R06-RS202版本更新至R07-RS207版本<br>致谢<br>本研究的计算资源依托瑞典国家学术超算基础设施(National Academic Infrastructure for Supercomputing in Sweden, NAISS)在UPPMAX的项目[NAISS 2023/22-601, SNIC 2022/22-500和SNIC 2021/22-263],该设施由瑞典研究理事会通过资助协议编号2022-06725提供支持。<br>本研究的计算资源同时由巴塞罗那环境评估与水研究研究所(IDAEA-CSIC)的Maria Vila-Costa博士提供。
提供机构:
Linnéuniversitetet
创建时间:
2025-04-29
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是基于GTDB R10RS226版本,通过Sativa程序进行系统发育一致性筛选的16S序列集合,旨在提高微生物分类学分析的准确性。它提供了用于DADA2分析的assignTaxonomy和addSpecies文件,支持不同基因组数量版本,并可与nf-core/ampliseq工作流集成,适用于16S扩增子研究和系统发育分析。
以上内容由遇见数据集搜集并总结生成



