COI reference sequences from BOLD DB

Name: COI reference sequences from BOLD DB
Creator: Swedish Museum of Natural History
Published: 2026-02-06 13:31:20
License: 暂无描述

DataCite Commons2026-02-06 更新2024-07-13 收录

下载链接：

https://figshare.scilifelab.se/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192/4

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The fasta file bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public Data Portal and each fasta header contains the taxonomic ranks + the BIN ID assigned to the record. The taxonomic information for each record is also given in the tab-separated file bold_info_filtered.tsv.gz. The file bold_clustered.sintax.fasta.gz is directly compatible with the SINTAX algorithm in vsearch while files bold_clustered.assignTaxonomy.fasta.gz and bold_clustered.addSpecies.fasta.gz are directly compatible with the assignTaxonomy and addSpecies functions from DADA2, respectively. The dataset was last created on December 16, 2022 NOTE: We have noticed that the gzipped files in this upload have been compressed twice for some reason. A quick fix is to unzip any file with a ".gz" extension, then rename the unzipped file by adding the ".gz" extension back. Then running the unzipping once again. Sorry for the inconvenience.MethodsThe code used to generate this dataset consists of a snakemake workflow wrapped into a python package that can be installed with conda (`conda install -c bioconda coidb`). Firstly sequence and taxonomic information for records in the BOLD database is downloaded from the GBIF Hosted Datasets. This data is then filtered to only keep records annotated as 'COI-5P' and assigned to a BIN ID. The taxonomic information is parsed in order to assign species names and resolve higher level ranks for each BIN ID. Sequences are processed to remove gap characters and leading and trailing `N`s. After this, any sequences with remaining non-standard characters are removed. Sequences are then clustered at 100% identity using vsearch (Rognes _et al._ 2016). This clustering is done separately for sequences assigned to each BIN ID. For more information, see https://github.com/biodiversitydata-se/coidb

提供机构：

Swedish Museum of Natural History

创建时间：

2023-01-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集