five

Updating splits, lumps, and shuffles: Reconciling GenBank names with standardized avian taxonomies

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.gtht76hqf
下载链接
链接失效反馈
官方服务:
资源简介:
Abstract Biodiversity research has advanced by testing expectations of ecological and evolutionary hypotheses through the linking of large-scale genetic, distributional, and trait datasets. The rise of molecular systematics over the past 30 years has resulted in a wealth of DNA sequences from around the globe. Yet, advances in molecular systematics also have created taxonomic instability, as new estimates of evolutionary relationships and interpretations of species limits have required widespread scientific name changes. Taxonomic instability, colloquially “splits, lumps, and shuffles,” presents logistical challenges to large-scale biodiversity research because (1) the same species or sets of populations may be listed under different names in different data sources, or (2) the same name may apply to different sets of populations representing different taxonomic concepts. Consequently, distributional and trait data are often difficult to link directly to primary DNA sequence data without extensive and time-consuming curation. Here, we present RANT: Reconciliation of Avian NCBI Taxonomy. RANT applies taxonomic reconciliation to standardize avian taxon names in use in NCBI GenBank, a primary source of genetic data, to a widely used and regularly updated avian taxonomy: eBird/Clements. Of 14,341 avian species/subspecies names in GenBank, 11,031 directly matched an eBird/Clements; these link to more than 6 million nucleotide sequences. For the remaining unmatched avian names in GenBank, we used Avibase’s system of taxonomic concepts, taxonomic descriptions in Cornell’s Birds of the World, and DNA sequence metadata to identify corresponding eBird/Clements names. Reconciled names linked to more than 600,000 nucleotide sequences, ~9% of all avian sequences on GenBank. Nearly 10% of eBird/Clements names had nucleotide sequences listed under 2 or more GenBank names. Our taxonomic reconciliation is a first step towards rigorous and open-source curation of avian GenBank sequences and is available at GitHub, where it can be updated to correspond to future annual eBird/Clements taxonomic updates. Methods Taxonomic reconciliation We downloaded all names from the NCBI Taxonomy database (Schoch et al., 2020) that descended from “Aves” (TaxID: 8782) on 3 May 2020 (Data Repository D2). From this list, we extracted all species and subspecies names as well as their NCBI Taxonomy ID (TaxID) numbers. We then ran a custom Perl script (Data Repository D3) to exactly match binomial (genus, species) and trinomial (genus, species, subspecies) names from NCBI Taxonomy to the names recognized by eBird/Clements v2019 Integrated Checklist (August 2019; Data Repository D4). For each mismatch with the NCBI Taxonomy name, we then identified the corresponding equivalent eBird/Clements species or subspecies. We first searched for names in Avibase (Lepage et al., 2014). However, Avibase’s search function currently facilitates only exact matches to taxonomies it implements. For names that were not an exact match to an Avibase taxonomic concept, we implemented web searches (Google) which often identified minor spelling differences, consulted Cornell’s Birds of the World Online (https://birdsoftheworld.org), and consulted relevant literature— often the papers that first published those sequence data. We classified nine categories of naming mismatches resulting from discrepancies between GenBank and eBird/Clements names: split, lump, shuffle, new, spelling, hybrid, extinct, domesticated, and unidentified (Table 2). Split is a name that corresponds to a subspecies rank in GenBank, but a species rank in eBird/Clements. For example, the GenBank subspecies name Otus megalotis everetti (taxiid: 56274) corresponds to the species name Otus everetti in eBird/Clements. Lump is a name that corresponds to species rank in GenBank, but a subspecies rank in eBird/Clements. For example, the GenBank name Megascops colombianus (TaxID: 1740167) corresponds to Megascops ingens colombianus in eBird/Clements. Shuffle is a taxon that has an equivalent rank in GenBank and eBird/Clements, but different name usage. Most often shuffles stem from changes in genera, but a few species epithets have changed because of new evidence regarding nomenclature priority. For example, the GenBank name Mimizuku gurneyi (id: 56287) corresponds to Otus gurneyi in eBird/Clements, reflecting a change in the generic name. New is a species or subspecies that was undescribed when its sequences were initially uploaded to GenBank. To preserve nomenclature priority, GenBank avoids unpublished or in-press names of undescribed taxa, instead assigning an informal placeholder name. Typically, the placeholder name consists of the genus, the data uploaders' initials, and the year of first upload. For example, Megascops_sp._SMD-2015 (TaxID: 1740173) corresponds to the Santa Marta Screech-Owl, Megascops gilesi, Krabbe, 2017. Spelling is a taxon that has an equivalent name in GenBank and eBird/Clements, but for which a slightly different spelling is implemented. For example, the GenBank name Glaucidium nanum (TaxID: 126809) corresponds to the eBird/Clements name Glaucidium nana. Hybrid is a hybrid individual and usually identified in GenBank by a name comprising the putative parental species separated by a cross “x”. For example, the GenBank name Strix occidentalis x Strix varia. Hybrids were not reconciled to eBird/Clements names, although eBird taxonomy does include and organize names for some frequent avian hybrid parental combinations. Extinct is an extinct taxon that is not regulated by eBird/Clements because it was not documented in the modern era. For example, the elephant bird Aepyornis maximus (TaxID: 748142) is known from Holocene bones and eggshell materials that have yielded DNA sequences, but this name is not regulated by eBird/Clements. Domesticated is a domesticated breed or line. For example, GenBank has a listing for the domesticated “Society Finch” as Lonchura striata domestica (TaxID: 299123), but in eBird/Clements it refers to Lonchura striata because domesticated forms are not generally considered subspecies. Finally, Unidentified refers to TaxIDs where we were unable to assign a species name. These were generally samples not identified to species, or environmental DNA samples. We summarized the total number and proportion of reconciled GenBank TaxIDs by bird orders, and within the largest bird order Passerformes, by families. We also summarized the number of GenBank nucleotide sequences and number of reconciliations for each IUCN conservation status category. For a taxon that did not have a direct match to an IUCN name, we placed it under “Not Assessed”. GenBank sequences associated with avian names We tallied the number of core nucleotide sequences in GenBank associated with each taxonomic ID by downloading the “nucl_gb.accession2TaxID” file on 2 November 2020 (Data Repository D5). This file lists the accession number for each sequence in the GenBank nucleotide database and its corresponding taxonomic ID number. From this, we wrote a Perl script (Data Repository D6) to count the number of nucleotide sequences associated with each taxonomic ID corresponding to an avian taxonomic ID. To obtain counts of the number of runs in the NCBI Sequence Read Archive (SRA) associated with each bird species, we downloaded the “RunInfo” for the SRA runs (“SraRunInfo.csv”) within “Aves” on August 1, 2021 (Data Repository D7). To obtain counts of the number of genome sequences in GenBank associated with each name, we downloaded from NCBI on September 5, 2021 a summary of the NCBI Genome files (“genome_result.txt”) within “Aves” (Data Repository D8). Linking eBird/Clements names to geographic realms For TaxIDs that were successfully assigned to eBird/Clements species names (either by direct name match or taxonomic reconciliation), we delimited their geographic realms using the associated IOC breeding ranges (eight terrestrial realms and four oceanic realms). Here we implemented IOC, rather than eBird/Clements geographic information because eBird/Clements does not summarize species occurrence by geographic realm. We also manually assigned geographic realms for species without range information available in the IOC v10.1 checklist (master_ioc_list_v10.1.xlsx). We defined species that occur in only one realm as realm endemics, and species that occur in two or more realms as widespread. We then summarized the number of reconciliations and the number of GenBank nucleotide sequences for each realm, and widespread species. Linking eBird/Clements names to other databases We used audio data as an example to examine the extent to which name-reconciled GenBank sequences apply to large avian comparative databases, such as Macaulay Library and Xeno-canto. Since Macaulay Library uses eBird/Clements taxonomy for its bird images, audios and videos, we can readily link these media resources to the GenBank nucleotide data under the same eBird/Clements names. We downloaded a summary of available audio data (April 2021) from Macaulay Library (https://www.macaulaylibrary.org/resources/media-target-species/; Data Repository D9). We also examined Xeno-canto, a global avian vocalization database, which uses the IOC taxonomy. To match Xeno-canto’s 10,909 avian names to eBird/Clements names, we filtered out the species with a direct name match and then reconciled the remaining using Avibase taxonomic concepts. Lastly, we summed up the number of Xeno-canto sound recordings (October 2020; https://www.xeno-canto.org/collection/species/all; Data Repository D10) under the same eBird/Clements name. For example, the Xeno-canto name Colinus leucopogon had 26 sound recordings and Colinus cristatus had 57, but the eBird/Clements name C. cristatus would have 83, because C. leucopogon is treated as a subspecies of C. cristatus by eBird/Clements.
创建时间:
2022-08-25
二维码
社区交流群
二维码
科研交流群
商业服务