Data from: Rapid and accurate taxonomic classification of insect (Class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
收藏DataONE2014-02-18 更新2024-06-27 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈官方服务:
资源简介:
Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on difficult to define thresholds of distances, sequence similarity cutoffs, or monophyly. Most methods do not provide a measure of confidence for the taxonomic assignments they provide. The aim of this study is to use a naïve Bayesian classifier (Wang et al., 2007) to automate unsupervised taxonomic assignments for large batches of insect COI sequences such as data obtained from environmental barcoding using next generation sequencing platforms. This method provides rank-flexible taxonomic assignments with an associated bootstrap support value and it is faster than the BLAST-based methods commonly used in environmental sequence surveys. We have developed and rigorously tested the performance of three different training sets using leave-one-out cross-validation, two field datasets, and targeted testing of Lepidoptera, Diptera, and Mantodea sequences obtained from the Barcode of Life Data system. We found that type I error rates, incorrect taxonomic assignments with a high bootstrap support, were already relatively low but could be lowered further by ensuring that all query taxa are actually present in the reference database. Choosing bootstrap support cutoffs according to query length and summarizing taxonomic assignments to more inclusive ranks can also help to reduce error while retaining the maximum number of assignments. Additionally, we highlight gaps in the taxonomic and geographic representation of insects in public sequence databases that will require further work by taxonomists to improve the quality of assignments generated using any method.
当前用于鉴定未知昆虫(昆虫纲Insecta)细胞色素c氧化酶(COI条形码,COI barcode)序列的方法,往往依赖于难以界定的距离阈值、序列相似性阈值或单系性判定标准。多数此类方法无法为其所输出的分类学鉴定结果提供置信度量化指标。本研究旨在采用朴素贝叶斯分类器(naïve Bayesian classifier,Wang等,2007),实现针对大批量昆虫COI序列的无监督分类学鉴定自动化,这类序列多来源于利用下一代测序平台开展的环境条形码测序研究。该方法可输出层级灵活的分类学鉴定结果,并附带对应的自助法支持值,且相较于环境序列调研中常用的基于BLAST的分析方法,运算速度更快。本研究开发了三套不同的训练数据集,并通过留一交叉验证法、两套野外实测数据集,以及针对从Barcode of Life Data system获取的鳞翅目、双翅目和螳螂目序列的针对性测试,对三套数据集的分类性能进行了严格验证。研究发现,一类错误率——即附带高自助法支持值的错误分类学鉴定结果——本已处于较低水平,但可通过确保所有查询分类单元均实际存在于参考数据库中进一步降低。此外,根据查询序列长度设置自助法支持值阈值,以及将鉴定结果归并至更高阶的分类等级,也可在保留最多有效鉴定结果的同时帮助降低错误率。本研究同时指出,公共序列数据库中昆虫的分类学与地理学覆盖范围存在缺口,这需要分类学家开展进一步工作,以提升任何基于此类数据库的分类鉴定方法的输出质量。
创建时间:
2014-02-18



