COI reference sequences from BOLD DB
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://figshare.com/articles/dataset/COI_reference_sequences_from_BOLD_DB/20514192
下载链接
链接失效反馈官方服务:
资源简介:
Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The dataset is based on the BOLD Data Package from 30 January 2026 and was created on February 1 2026.
The fasta file coidb.clustered.fasta.gz represents a non-redundant set of filtered sequences (clustered at 100% identity, see Methods) with record ids that can be queried in the Public Data Portal. Each fasta header also contains the BIN ID assigned to the record (with the exception of prokaryotic records which instead have process ids as BIN IDs).
The taxonomic information for all filtered records is given in the tab-separated file coidb.info.tsv.gz.
Files compatible with specific tools for taxonomic assignments are found under the dada2/, sintax/, and qiime2/ folders.
MethodsThis dataset was generated with the coidb package (v0.6.0).
Briefly, records from the BOLD Data Package are filtered to
keep only records assigned a proper BOLD BIN (e.g. `BOLD:AAA0008`), as well as records assigned to Bacteria or Archaeakeep only records with marker_code 'COI-5P'remove records shorter than 500 bpremove records containing non-standard DNA charactersRemaining sequences are then clustered at 100% identity separately for each BOLD BIN using vsearch (Rognes et al. 2016) (records without BOLD BINs that are assigned to Bacteria/Archaea are not clustered).
The taxonomic information for records is processed to handle missing data and non-unique parent lineages. A consensus taxonomy for each BOLD BIN is calculated by taking into account the taxonomic information given for records assigned to each BIN. This is done in two ways:
the `inclNA` method calculates a consensus based on all taxonomic labels, even the ones with missing datathe `exclNA` method excludes taxonomic labels with missing data when calculating the consensusBecause these methods have their pros and cons (in short `exclNA` resolves more species but `inclNA` is more conservative) both versions of downstream files are available in this item and it is up to the user to decide which one to use.
Description of filescoidb.clustered.fasta.gzThis file contains nucleotide sequences of all filtered records, clustered at 100% identity within each BOLD BIN. The fasta headers have the format:
>{processid} bin_uri:{BOLD BIN}
where '{processid}' corresponds to the record identifier chosen as the cluster centroid and '{BOLD BIN}' shows which BOLD BIN the record belongs to.
coidb.info.tsv.gzThis file contains taxonomic information (including BOLD BIN where applicable) as well as nucleotide sequences for all filtered records.
coidb.stats.exclNA.txt / coidb.stats.inclNA.txtThese files contain summary statistics with number of total records, unique BINs, clustered sequences etc. The first seven lines are identical as they refer to general statistics of the database while the rest is specific to the method used to calculate the consensus taxonomy (see Methods).
timestamps.txtThis file shows the name of the BOLD Data Package and the TSV file extracted andused as input to coidb.
logs/fix_nonunique.coidb.logThis logfile shows how taxa with non-unique parent lineages were modified during database creation.
shasum.txtThis file contains checksums and can be used to verify file integrity by running
shasum -c shasum.txtTool-specific filesDADA2
The dada2/ folder contains fasta files that are compatible with the DADA2 assignTaxonomy and addSpecies functions. See more information at https://benjjneb.github.io/dada2/assign.html.
The files wtih 'toGenus' and 'toSpecies' in their names have taxonomic information down to the genus and species level, respectively. The files with 'addSpecies' contain only the species name and should be used with the 'addSpecies' function.
SINTAX
The sintax/ folder contains fasta files that are compatible with taxonomic assignments using the SINTAX algorithm as implemented in `vsearch`. See more information in the vsearch manual.
QIIME2
The qiime2/ folder contains info files that can be imported with QIIME2. For more information, see the README file at https://github.com/insect-biome-atlas/coidb.
创建时间:
2022-09-15



