Do pseudogenes pose a problem for metabarcoding marine animal communities?
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.1jwstqjvt
下载链接
链接失效反馈官方服务:
资源简介:
Because DNA metabarcoding typically employs sequence diversity among mitochondrial amplicons to estimate species composition, nuclear mitochondrial pseudogenes (NUMTs) can inflate diversity. This study quantifies the incidence and attributes of NUMTs derived from the 658 bp barcode region of cytochrome c oxidase I (COI) in 156 marine animal genomes. NUMTs were examined to ascertain if they could be recognized by their possession of indels or stop codons. In total, 309 NUMTs 150 bp were detected, with an average of 1.98 per species (range = 0–33) and a mean length of 391 bp 200 bp. Among this total, 75 (23.4%) lacked indels or stop codons. NUMTs appear to pose the greatest interpretational risk when short (< 313 bp) amplicons are used, such as in eDNA studies, dietary analyses, or processed fish identification. Employing the standard amplicon length (313 bp) for marine metabarcoding, NUMTs could potentially inflate the OTU count by 21% above the true species count while also raising intraspecific variation at COI by 15%. However, when both amplicon length and position are considered, inflation in OTU counts and in barcode variation were just 9% and 10%, respectively, suggesting NUMTs will not seriously distort biodiversity assessments. There was a weak positive correlation between genome size and NUMT count but no variation among phyla or trophic groups. Until bioinformatic advances improve NUMT detection, the best defense involves targeting long amplicons and developing reference databases that include both mitochondrial sequences and their NUMT derivatives.
Methods
Data collection
We examined the incidence of COI NUMTs in the genomes of marine animals on the NCBI genome browser (Clark et al. 2016). To identify candidate genomes, we compared taxonomic names in the World Register of Marine Species (WoRMS; Horton et al. 2020) with the NCBI genome browser (https://www.ncbi.nlm.nih.gov/genome/browse). All genomes for marine invertebrates were downloaded together with those for at least one species per order of marine vertebrates, selected haphazardly. When more than one genome was available for a species, the reference genome (if available) or the most recent assembly was selected. In addition, we downloaded the COI sequence from the mitochondrial genome of each species and used AliView (Larson 2014) to extract the 658 bp recovered by primers targeting the barcode region (Hebert et al. 2003). When available, the reference sequence for the full COI gene was also retained. When a COI sequence was unavailable on GenBank, the Barcode of Life Database (BOLD; Ratnasingham and Hebert 2007) was searched for a sequence.
NUMT search and identification
We conducted BLAST searches for mitochondrial COI against the genome sequence available for each species using the 658 bp barcode region as the query. Using Geneious Prime (version 2020.2.1), we conducted a BLASTn search with a maximum of 1000 hits and a maximum expectation value of e = 0.0001 to generate a list of hits. We excluded BLASTn hits < 150 bp in length, or those with both 100% coverage and ≥ 99.8% ID as these likely represented a mitochondrial sequence inadvertently included in the nuclear assembly.
The remaining hits were considered putative COI NUMTs and summary information (hit length, GC content, query coverage, percent similarity, e-value) were exported to Excel (Supp. Data Table S1). Using the BLASTn alignments in Geneious Prime (version 2020.2.1), each hit was individually aligned with the mitochondrial COI sequence for that species to visually search for any insertions and deletions. Each sequence was then translated using the appropriate mitochondrial code to determine if premature stop codons were present. To determine the correct reading frame, we tested each codon position until no premature stop codons appeared in the COI reference sequence. The presence of indels or premature stop codons (IPSCs) at any position along the sequence was recorded for all hits ≥ 150 bp.
Since using a longer COI query length could reveal additional NUMTs beyond the 658 bp barcode region, we conducted a second BLASTn search among the invertebrates in our dataset using the full-length (1500 bp) COI sequence when available (Supp. Figure S1). We used the same BLASTn parameters and strategy for identifying IPSCs as for the 658 bp query and retained all hits >150 bp. This analysis made it possible to ascertain if certain regions of COI were more prone to incorporation into NUMTs by mapping hits to the reference COI sequence and then quantifying the coverage at each nucleotide position. We then plotted the coverage for all species in a particular phylum to determine the frequency with which each nucleotide position of COI appeared in a NUMT.
NUMT diagnosis
In addition to examining all NUMTs ≥ 150 bp for diagnostic features, we quantified the incidence of NUMTs that presented an interpretational threat by ascertaining the proportion of hits lacking diagnostic features at four sequence lengths (150, 313, 500, 650 bp) commonly used in marine barcoding, metabarcoding, and eDNA approaches (Hebert, Ratnasingham, & Dewaard, 2003; Leray et al., 2013; Ratnasingham & Hebert, 2013; Shokralla et al., 2015; van der Loos & Nijland, 2020; S. Zhang, Zhao, & Yao, 2020). HTS platforms have the potential to recoverall NUMTs, but only those with IPSCs in the target region will be diagnosed. For instance, platforms that generate 150 bp reads will capture all NUMTs 150 bp, but they will only be recognizable as NUMTs if they possess IPSCs within the first 150 bp. Using results from the 658 bp query, we therefore considered hits diagnosable if they contained IPSCs in the target sequence region (i.e., the first 150, 313, 500, or 650 bp). The mean number of both diagnosable and non-diagnosable hits per species was compared among the four sequence length categories using a Friedman test with a Bonferroni correction of k = 2 and adjusted α of 0.025 (see Supp. Info Table S3 for a summary of statistical tests used). The median total number hits per species was compared to the median number of undiagnosable hits per species within each length category using Sign tests (Bonferroni correction: k = 4, adjusted α = 0.0123). In addition, we examined the impact of NUMT divergence values on metabarcoding results, employing a fixed sequence divergence threshold for delineating OTUs. We chose a 3% divergence threshold because of its widespread use in marine metazoan surveys (Leray and Knowlton 2015, 2017, Cahill et al. 2018). However, we also report results with 2% and 4% thresholds as some studies have employed them. For instance, the BIN system uses 2.2% for initial clustering as part of the RESL clustering approach (Ratnasingham and Hebert 2013), and 4% is sometimes used as an upper bound in Bayesian approaches (e.g., Hao et al. 2011, Leray and Knowlton 2017). Divergence values are only important for non-diagnosable NUMTs that are not excluded from downstream analysis. We therefore ascertained the proportion of undiagnosable NUMTs that would either inflate the OTU count (>2, 3, or 4% divergence) or intraspecific barcode variation (<2, 3, or 4% divergence).
Because the 313 bp Leray fragment (Leray et al. 2013) is commonly used in marine metabarcoding applications, we further examined the number of sequences likely to appear in studies targeting this fragment when both the length and position of the NUMT in the barcode region are considered. The Leray fragment is located near the 3’ terminus of the 658 bp barcode amplified by Folmer (Folmer et al. 1994) or Geller (Geller et al. 2013) primers. Because it begins at approximately the 345th bp of COI, we reasoned that a NUMT is likely to be amplified by the Leray primers if: 1) the starting base pair of the NUMT is at least 10 bp before the 345th bp of COI to allow binding of the forward primer; 2) it is long enough to span the Leray fragment (313 bp); and 3) it includes an additional 10 bp to allow binding of the reverse primer. Because our query length was restricted to 658 bp, we not could not ascertain if a hit meeting the first two criteria also met the third criterion, but this should often have been the case. We quantified the number of NUMTs that met these criteria and that lacked an IPSC to determine the proportion that could pose a problem for studies targeting this region.
Patterns of NUMT abundance among species
We examined the relationship between the number of hits (greater than or equal to 150 bp) and genome size, the quality of the assembly (contig N50), and genome coverage reported on NCBI using Spearman’s rank correlations (Bonferroni correction: k = 3, adjusted α = 0.0167). In addition, we examined if species in certain phyla or those with differing ecological traits were more likely to possess NUMTs. For these comparisons, we used results from the COI barcode query length (658 bp) and included all hits greater than or equal to 150 bp. We compared the average number of total hits per species among phyla, the average number of undiagnosable hits among phyla, and the average number of total hits among trophic groups using Kruskal-Wallis rank sum tests (Bonferroni correction: k = 3; adjusted α = 0.0167).
Ecological information was primarily compiled from the Encyclopedia of Life (http://eol.org; accessed 12 Sept 2020), but additional information was obtained from the primary literature to fill gaps (see Supp. Data Table S2 for specific references). We recognized six trophic categories (predator/carnivore, grazer/herbivore, parasite, suspension feeder, omnivore, other) on the basis of adult feeding habits. The ‘suspension feeder’ category included passive suspension feeders, active filter feeders, and mucous net feeders. The ‘omnivore’ category included species with more than one equally prevalent trophic level or guild. ‘Other’ included a chemosymbiotroph, a surface deposit feeder, and two detritivores.
创建时间:
2023-08-18



