How reliable is metabarcoding for pollen identification? An evaluation of different taxonomic assignment strategies by cross-validation.

Name: How reliable is metabarcoding for pollen identification? An evaluation of different taxonomic assignment strategies by cross-validation.
Creator: Dubois, Benjamin; Hautier, Louis; Mingeot, dominique; San Martin, Gilles
Published: 2023-11-14 00:00:00
License: 暂无描述

Figshare2023-11-14 更新2026-04-08 收录

下载链接：

https://figshare.com/articles/dataset/How_reliable_is_metabarcoding_for_pollen_identification_An_evaluation_of_different_taxonomic_assignment_strategies_by_cross-validation_/23691579/1

下载链接

链接失效反馈

官方服务：

资源简介：

Supplements, data, and code for the following paper :San Martin G., Hautier L., Mingeot D., Dubois B. (submitted to PeerJ) How reliable is metabarcoding for pollen identification? An evaluation of different taxonomic assignment strategies by cross-validation.Unzip and see the README file for details.<br>Summary of the paper :<br>Metabarcoding is a powerful tool, increasingly used in many disciplines of environmental sciences. However, to assign a taxon to a DNA sequence, bioinformaticians need to choose between different strategies or parameter values and these choices sometimes seem rather arbitrary. In this work, we present a case study on ITS2 and _rbcL_ databases used to identify pollen collected by bees in Belgium. We blasted a random sample of sequences from the reference database against the remainder of the database using different strategies and compared the known taxonomy with the predicted one. This _in silico_ Cross-Validation (CV) approach proved to be an easy yet powerful way to 1) assess the relative accuracy of taxonomic predictions, 2) define rules to discard dubious taxonomic assignments and 3)provide a more objective basis to choose the best strategy. We obtained the best results with the best blast hit (best bit score) rather than by selecting the majority taxon from the top 10 hits. The predictions were further improved by favouring the most frequent taxon among those with tied best bit scores. We obtained better results with databases containing the full sequences available on NCBI rather than restricting the sequences to the region amplified by the primers chosen in our study. Leaked CV showed that when the true sequence is present in the database, blast might still struggle to match the right taxon at the species level, particularly with _rbcL_. Classical 10-fold CV - where the true sequence is removed from the database - offers a different yet more realistic view of the true error rates. Taxonomic predictions with this approach worked well up to the genus level, particularly for ITS2 (5-7% of errors). Using a database containing only the local flora of Belgium did not improve the predictions up to the genus level for local species and made them worse for foreign species. At the species level, using a database containing exclusively local species improved the predictions for local species by ~12% but the error rate remained rather high: 25% for ITS2 and 42% for _rbcL_. Foreign species performed worse even when using a world database (59-79% of errors). We used classification trees and GLMs to model the % of errors vs. identity and consensus scores and determine appropriate thresholds below which the taxonomic assignment should be discarded. This resulted in a significant reduction in prediction errors, but at the cost of a much higher proportion of unassigned sequences. Despite this stringent filtering, at least 1/5 sequences deemed suitable for species-level identification ultimately proved to be misidentified. An examination of the variability in prediction accuracy between plant families showed that _rbcL_ outperformed ITS2 for only 2 of the 27 families examined, and that the % correct species-level assignments were much better for some families (e.g. 95% for Sapindaceae) than for others (e.g. 35% for Salicaceae).

提供机构：

Dubois, Benjamin; Hautier, Louis; Mingeot, dominique; San Martin, Gilles

创建时间：

2023-11-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集