COI reference sequences from BOLD DB
收藏DataCite Commons2025-01-15 更新2025-04-16 收录
下载链接:
https://figshare.scilifelab.se/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192/1
下载链接
链接失效反馈官方服务:
资源简介:
Dataset description
This item contains COI (mitochondrial cytochrome oxidase subunit I) sequences<br>
collected from the BOLD database. The fasta file<br>
bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public<br>
Data Portal<br>
and each fasta header contains the taxonomic ranks + the BIN ID assigned to the<br>
record. The taxonomic information for each record is also given in the tab-separated<br>
file bold_info_filtered.tsv.gz.<br>
<br>
The dataset was last created on February 18, 2022.
<br>
<br>
Methods
The code used to generate this dataset consists of a snakemake workflow wrapped<br>
into a python package that can be installed with conda<br>
(`conda install -c bioconda coidb`).<br>
Firstly sequence and taxonomic information for records in the BOLD database is<br>
downloaded from the GBIF Hosted Datasets.<br>
This data is then filtered to only keep records annotated as 'COI-5P' and assigned<br>
to a BIN ID. The taxonomic information is parsed in order to assign species names<br>
and resolve higher level ranks for each BIN ID. Sequences are processed to remove<br>
gap characters and leading and trailing `N`s. After this, any sequences with<br>
remaining non-standard characters are removed.<br>
Sequences are then clustered at 100% identity using vsearch<br>
(Rognes _et al._ 2016). This clustering is done separately for sequences assigned<br>
to each BIN ID.
<br>
For more information, see https://github.com/biodiversitydata-se/coidb
提供机构:
Swedish Biodiversity Data Infrastructure (SBDI)
创建时间:
2022-09-15



