five

COI reference sequences from BOLD DB

收藏
DataCite Commons2025-01-15 更新2025-04-16 收录
下载链接:
https://figshare.scilifelab.se/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192/1
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset description This item contains COI (mitochondrial cytochrome oxidase subunit I) sequences<br> collected from the BOLD database. The fasta file<br> bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public<br> Data Portal<br> and each fasta header contains the taxonomic ranks + the BIN ID assigned to the<br> record. The taxonomic information for each record is also given in the tab-separated<br> file bold_info_filtered.tsv.gz.<br> <br> The dataset was last created on February 18, 2022. <br> <br> Methods The code used to generate this dataset consists of a snakemake workflow wrapped<br> into a python package that can be installed with conda<br> (`conda install -c bioconda coidb`).<br> Firstly sequence and taxonomic information for records in the BOLD database is<br> downloaded from the GBIF Hosted Datasets.<br> This data is then filtered to only keep records annotated as 'COI-5P' and assigned<br> to a BIN ID. The taxonomic information is parsed in order to assign species names<br> and resolve higher level ranks for each BIN ID. Sequences are processed to remove<br> gap characters and leading and trailing `N`s. After this, any sequences with<br> remaining non-standard characters are removed.<br> Sequences are then clustered at 100% identity using vsearch<br> (Rognes _et al._ 2016). This clustering is done separately for sequences assigned<br> to each BIN ID.    <br> For more information, see https://github.com/biodiversitydata-se/coidb
提供机构:
Swedish Biodiversity Data Infrastructure (SBDI)
创建时间:
2022-09-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作