COI reference sequences from BOLD DB

Name: COI reference sequences from BOLD DB
Creator: SciLifeLab
Published: 2024-04-17 00:00:00
License: 暂无描述

figshare.scilifelab.se2024-04-17 更新2025-01-21 收录

下载链接：

https://figshare.scilifelab.se/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192/4

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The fasta file bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public Data Portal and each fasta header contains the taxonomic ranks + the BIN ID assigned to the record. The taxonomic information for each record is also given in the tab-separated file bold_info_filtered.tsv.gz.The file bold_clustered.sintax.fasta.gz is directly compatible with the SINTAX algorithm in vsearch while files bold_clustered.assignTaxonomy.fasta.gz and bold_clustered.addSpecies.fasta.gz are directly compatible with the assignTaxonomy and addSpecies functions from DADA2, respectively. The dataset was last created on December 16, 2022NOTE: We have noticed that the gzipped files in this upload have been compressed twice for some reason. A quick fix is to unzip any file with a ".gz" extension, then rename the unzipped file by adding the ".gz" extension back. Then running the unzipping once again. Sorry for the inconvenience.MethodsThe code used to generate this dataset consists of a snakemake workflow wrapped into a python package that can be installed with conda (`conda install -c bioconda coidb`). Firstly sequence and taxonomic information for records in the BOLD database is downloaded from the GBIF Hosted Datasets. This data is then filtered to only keep records annotated as 'COI-5P' and assigned to a BIN ID. The taxonomic information is parsed in order to assign species names and resolve higher level ranks for each BIN ID. Sequences are processed to remove gap characters and leading and trailing `N`s. After this, any sequences with remaining non-standard characters are removed. Sequences are then clustered at 100% identity using vsearch (Rognes _et al._ 2016). This clustering is done separately for sequences assigned to each BIN ID.For more information, see https://github.com/biodiversitydata-se/coidb

本数据集收录了源自BOLD数据库的线粒体细胞色素氧化酶亚基I（COI）序列。fasta文件bold_clustered_cleaned.fasta.gz中包含可于公共数据门户查询的记录ID，每个fasta头部均包含分类等级及分配给记录的BIN ID。每个记录的分类信息亦以制表符分隔的文件bold_info_filtered.tsv.gz提供。文件bold_clustered.sintax.fasta.gz可直接用于vsearch软件中的SINTAX算法，而bold_clustered.assignTaxonomy.fasta.gz和bold_clustered.addSpecies.fasta.gz文件则分别与DADA2的assignTaxonomy和addSpecies功能直接兼容。该数据集最后更新于2022年12月16日。注意：我们发现本次上传的.gz压缩文件因某种原因被重复压缩。一个快速的修复方法是解压缩任何具有“.gz”扩展名的文件，然后通过重新添加“.gz”扩展名来重命名解压缩后的文件，随后再次进行解压缩。对此带来的不便表示歉意。方法：生成该数据集所使用的代码由一个snakemake工作流程封装成Python包，可通过conda进行安装（`conda install -c bioconda coidb`）。首先，从GBIF托管数据集下载BOLD数据库中记录的序列和分类信息。然后，仅保留被标注为'COI-5P'并分配BIN ID的记录。对分类信息进行解析，以分配物种名称并解析每个BIN ID的高级分类等级。对序列进行处理，以去除间隙字符以及首尾的`N`。在此之后，移除任何仍含有非标准字符的序列。随后，使用vsearch（Rognes等，2016年）在100%的序列同源性上进行序列聚类。对于分配给每个BIN ID的序列，分别进行聚类。更多详细信息，请参阅https://github.com/biodiversitydata-se/coidb。

提供机构：

SciLifeLab

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集