Data from: Rapid and accurate taxonomic classification of insect (Class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
收藏DataONE2014-02-18 更新2024-06-27 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈官方服务:
资源简介:
Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on difficult to define thresholds of distances, sequence similarity cutoffs, or monophyly. Most methods do not provide a measure of confidence for the taxonomic assignments they provide. The aim of this study is to use a naïve Bayesian classifier (Wang et al., 2007) to automate unsupervised taxonomic assignments for large batches of insect COI sequences such as data obtained from environmental barcoding using next generation sequencing platforms. This method provides rank-flexible taxonomic assignments with an associated bootstrap support value and it is faster than the BLAST-based methods commonly used in environmental sequence surveys. We have developed and rigorously tested the performance of three different training sets using leave-one-out cross-validation, two field datasets, and targeted testing of Lepidoptera, Diptera, and Mantodea sequences obtained from the Barcode of Life Data system. We found that type I error rates, incorrect taxonomic assignments with a high bootstrap support, were already relatively low but could be lowered further by ensuring that all query taxa are actually present in the reference database. Choosing bootstrap support cutoffs according to query length and summarizing taxonomic assignments to more inclusive ranks can also help to reduce error while retaining the maximum number of assignments. Additionally, we highlight gaps in the taxonomic and geographic representation of insects in public sequence databases that will require further work by taxonomists to improve the quality of assignments generated using any method.
当前用于鉴定未知昆虫(昆虫纲Insecta)细胞色素c氧化酶(COI条形码)序列的方法,往往依赖于难以明确定义的距离阈值、序列相似性截断值或单系性标准。绝大多数此类方法无法为其所生成的分类学归属鉴定结果提供置信度评估。本研究旨在借助朴素贝叶斯分类器(naïve Bayesian classifier,Wang等,2007),实现针对大批量昆虫COI序列的无监督分类学归属自动鉴定,此类序列可来自基于下一代测序平台的环境条形码测序数据。该方法可生成分类阶元灵活的分类学归属鉴定结果,并附带对应的自助法支持值(bootstrap support value),且相较于环境序列调研中常用的基于BLAST的方法,运行速度更快。我们开发了三套不同的训练集,并通过留一交叉验证(leave-one-out cross-validation)、两份野外实测数据集,以及针对从生命条形码数据系统(Barcode of Life Data system)获取的鳞翅目(Lepidoptera)、双翅目(Diptera)和螳螂目(Mantodea)序列的针对性测试,严格验证了各训练集的性能。我们发现,I型错误率(type I error rates,即伴随高自助法支持值的错误分类学归属鉴定结果)本已处于较低水平,但通过确保所有查询类群确实存在于参考数据库中,可进一步降低该错误率。根据查询序列长度设置自助法支持值截断阈值,以及将分类学归属鉴定结果归并至更高阶元的分类等级,同样有助于在保留最多鉴定结果数量的同时降低错误率。此外,我们还指出了公共序列数据库中昆虫类群在分类学和地理分布覆盖上的空白,此类空白需要分类学家开展进一步工作,以提升任何分类归属鉴定方法所生成结果的质量。
创建时间:
2014-02-18



