COI reference sequences from BOLD DB

Name: COI reference sequences from BOLD DB
Creator: Swedish Biodiversity Data Infrastructure (SBDI)
Published: 2025-01-15 14:48:01
License: 暂无描述

DataCite Commons2025-01-15 更新2025-04-16 收录

下载链接：

https://figshare.scilifelab.se/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192/1

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset description This item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The fasta file bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public Data Portal and each fasta header contains the taxonomic ranks + the BIN ID assigned to the record. The taxonomic information for each record is also given in the tab-separated file bold_info_filtered.tsv.gz. The dataset was last created on February 18, 2022. Methods The code used to generate this dataset consists of a snakemake workflow wrapped into a python package that can be installed with conda (`conda install -c bioconda coidb`). Firstly sequence and taxonomic information for records in the BOLD database is downloaded from the GBIF Hosted Datasets. This data is then filtered to only keep records annotated as 'COI-5P' and assigned to a BIN ID. The taxonomic information is parsed in order to assign species names and resolve higher level ranks for each BIN ID. Sequences are processed to remove gap characters and leading and trailing `N`s. After this, any sequences with remaining non-standard characters are removed. Sequences are then clustered at 100% identity using vsearch (Rognes _et al._ 2016). This clustering is done separately for sequences assigned to each BIN ID. For more information, see https://github.com/biodiversitydata-se/coidb

提供机构：

Swedish Biodiversity Data Infrastructure (SBDI)

创建时间：

2022-09-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集