SH information from UNITE databases
收藏DataCite Commons2025-01-15 更新2024-07-13 收录
下载链接:
https://figshare.scilifelab.se/articles/dataset/SH_information_from_UNITE_databases/19411403/1
下载链接
链接失效反馈官方服务:
资源简介:
The data is the result of querying the PlutoF API (https://plutof.docs.apiary.io, Abarenkov et al 2010) with all sequence names in the UNITE general FASTA release (https://doi.org/10.15156/BIO/786368; Abarenkov et al. 2020, https://doi.org/10.15156/BIO/1280049; Abarenkov et al. 2021), in order to find the sequence hypothesis (SH) at level 1.5, version 8, for each sequence, resulting in a sequence-to-SH matching file (*.seq2SH.tsv). For each SH, the complete taxonomy is extracted from PlutoF by querying the PlutoF API, and stored in the *.SH.tax files.<br>
<br>
Files are available for UNITE version 8.2; sh_general_release_dynamic_04.02.2020.seq2sh.tsv.bz2 containing sequence to SH matchings, and sh_general_release_dynamic_04.02.2020.SHs.tax.bz2 containing SH taxonomies, and for UNITE version 8.3; sh_general_release_dynamic_10.05.2021.seq2sh.tsv.bz2 and sh_general_release_dynamic_10.05.2021.SHs.tax.bz2. All files are tab separated text files compressed with bzip2.<br>
<br>
Corresponding files are also available for the all eukaryotes version if the UNITE database (https://doi.org/10.15156/BIO/786370; Abarenkov et al 2020b, https://doi.org/10.15156/BIO/1280127; Abarenkov et al 2021b)<br>
Assignment of species hypothesis to ITS amplicons using this data and the UNITE general FASTA release is available as an optional argument to the nf-core/ampliseq Nextflow workflow from version 2.3.2: `--addsh` together with `--dada_ref_taxonomy unite-fungi` (https://nf-co.re/ampliseq; Straub et al. 2020).
<br>
<strong>Generation of files</strong><br>
After download and file extraction of the UNITE general FASTA release, each sequence name in the fasta file was used as query to PlutoF to find which SH at level 1.5 in release 8 the sequence belongs to, in order to generate the *.seq2sh.tsv files with sequence to SH matchings. Each SH was subsequently used as query to PlutoF to extract the complete taxonomy for the SH, stored in the *.SHs.tax files.<br>
Two python scripts for automatic querying and generation of the files can be found in the `scripts` folder in the GitHub repo: https://github.com/biodiversitydata-se/unite-shinfo. See the accompanying README file for usage information.<br>
本数据集源自针对PlutoF应用程序编程接口(Application Programming Interface, API,https://plutof.docs.apiary.io,Abarenkov等人2010年)的查询,查询对象为UNITE通用FASTA发布版(https://doi.org/10.15156/BIO/786368; Abarenkov等人2020年,https://doi.org/10.15156/BIO/1280049; Abarenkov等人2021年)中的全部序列名称,旨在为每条序列匹配版本8的1.5级序列假说(Sequence Hypothesis, SH),最终生成序列到SH的匹配文件(*.seq2SH.tsv)。随后,通过再次调用PlutoF API,为每个SH提取完整分类学信息,并存储至*.SH.tax文件中。
本数据集提供UNITE版本8.2对应的文件:包含序列与SH匹配关系的`sh_general_release_dynamic_04.02.2020.seq2sh.tsv.bz2`,以及包含SH分类学信息的`sh_general_release_dynamic_04.02.2020.SHs.tax.bz2`;同时提供UNITE版本8.3对应的文件:`sh_general_release_dynamic_10.05.2021.seq2sh.tsv.bz2`与`sh_general_release_dynamic_10.05.2021.SHs.tax.bz2`。所有文件均为采用bzip2压缩的制表符分隔文本文件。
针对UNITE数据库的全真核生物版本,也提供了对应的匹配文件(https://doi.org/10.15156/BIO/786370; Abarenkov等人2020b年,https://doi.org/10.15156/BIO/1280127; Abarenkov等人2021b年)。
使用本数据集与UNITE通用FASTA发布版为内转录间隔区(Internal Transcribed Spacer, ITS)扩增子分配物种假说的功能,已作为可选参数集成至nf-core/ampliseq Nextflow工作流的2.3.2及更高版本:通过指定`--addsh`参数并搭配`--dada_ref_taxonomy unite-fungi`即可启用(https://nf-co.re/ampliseq; Straub等人2020年)。
**文件生成流程**
在下载并解压UNITE通用FASTA发布版后,将FASTA文件中的每条序列名称作为查询关键词调用PlutoF接口,检索该序列所属的版本8发布版中1.5级SH,以此生成包含序列与SH匹配关系的*.seq2sh.tsv文件。随后,以每个SH作为查询关键词再次调用PlutoF接口,提取该SH的完整分类学信息并存储至*.SHs.tax文件中。
本数据集附带了用于自动查询并生成上述文件的两段Python脚本,可在GitHub仓库https://github.com/biodiversitydata-se/unite-shinfo的`scripts`文件夹中获取,使用方法详见配套的README文档。
提供机构:
Swedish Biodiversity Data Infrastructure (SBDI)
创建时间:
2022-05-13



