A latitudinal gradient of reference genomes

NIAID Data Ecosystem2026-05-02 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.2v6wwpzxh

下载链接

链接失效反馈

官方服务：

资源简介：

Global inequality rooted in legacies of colonialism and uneven development can lead to systematic biases in scientific knowledge. In ecology and evolutionary biology, findings, funding, and research effort are disproportionately concentrated at high latitudes, while biological diversity is concentrated at low latitudes. This discrepancy may have a particular influence in fields like phylogeography, molecular ecology, and conservation genetics, where the rise of genomics has increased the cost and technical expertise required to apply state-of-the-art methods. Here, we ask whether a fundamental biogeographic pattern – the latitudinal gradient of species richness in tetrapods – is reflected in available reference genomes, an important data resource for various applications of molecular tools for biodiversity research and conservation. We also ask whether sequencing approaches differ between the Global South and Global North, reviewing the last five years of conservation genetics research in four leading journals. We find that extant reference genomes are scarce relative to species richness at low latitudes, and that reduced-representation and whole-genome sequencing are disproportionately applied to taxa in the Global North. We conclude with recommendations to close this gap and improve international collaborations in biodiversity genomics. Methods We used the National Center for Biotechnology Information (NCBI) Datasets command-line tools v.16.19.0 (O’Leary et al. 2024) to download taxonomy metadata for the subset of species with an assembled reference genome in the following taxa: birds (Class: Aves), mammals (Class: Mammalia), squamates (Order: Squamata), amphibians (Class: Amphibia), turtles (Order: Testudines), crocodilians (Order: Crocodilia) and tuataras (Order: Rhynchocephalia). We selected these groups—together comprising extant tetrapods—to provide a snapshot of animal diversity in relatively well-studied clades with different ecologies and evolutionary histories, while restricting the total dataset to a computationally manageable size. From this initial list we retained species with an exact match to the Global Biodiversity Information Facility’s (GBIF) Backbone Taxonomy using rgbif v.3.8.0 (Chamberlain et al. 2024) and downloaded all observations of each backed by georeferenced voucher specimens in natural history museum collections (NHCs), excluding those without coordinates and those flagged for geospatial issues (n= 3,006,946). We repeated this process for all species in each higher-level taxon represented in our list of reference genomes (i.e., downloaded metadata for all georeferenced tetrapod specimens on GBIF; n= 9,303,258). DOIs for each download are available in the References section below (GBIF.org 2024a; GBIF.org 2024b; GBIF.org 2024c; GBIF.org 2024d; GBIF.org 2024e; GBIF.org 2024f; GBIF.org 2024g; GBIF.org 2024h). Filtering these aggregated datasets to contain only species with 10 or more specimen records, we generated convex hull polygons for each as a coarse approximation of their geographic distribution using the R package sf v.1.0-16 (Pebesma 2018; Pebesma & Bivand 2023). Overlaying these on a shapefile of Earth’s landmasses from rnaturalearth v.1.0.1 (Massicotte & South 2024), we calculated species richness as the number of overlapping convex hulls in 2-degree x 2-degree grid cells, statistically standardizing this value by subtracting observed mean global species richness and dividing by its standard deviation. We subtracted the number of species with reference genomes from total species richness to determine the regions with the largest representation gap in genomic resources, again standardizing the difference. To assess the significance and slope of a correlation between species richness and the absolute value (or modulus) of latitude in decimal degrees, we performed simple linear regressions in R v.4.4.0 (R Core Team 2024), analyzing species with reference genomes and our full dataset separately. To evaluate how the geography of authorship might impact sequencing strategy of studies in conservation biology, we performed a restricted Web of Science literature search on 29 June 2024 for English-language conservation genetics papers published in the last five years in the journals Conservation Genetics, Molecular Ecology, Journal of Heredity, and Conservation Biology, selected for frequently publishing empirical work on non-model organisms. We used the queries ‘SO=”Conservation Genetics”’ and ‘SO=("Molecular Ecology" OR "Journal of Heredity" OR "Conservation Biology") AND (TS="Conservation Genet*" OR KP="Conservation Genet*" OR TI="Conservation Genet*"’), excluding reviews, genome announcements, meta-analyses, preprints, and studies that were purely simulations. Our criteria aimed to achieve a tractable sample size for careful study (<1000 papers) while covering the period in which WGS became commonly used for the conservation genetics of non-model organisms (Fuentes-Pardo & Ruzzante 2017; Hohenlohe et al. 2021). We then manually reviewed each study, first assigning the home institution of its first and last author to the Global North, Global South or both (i.e., joint affiliations) using the 2024 UN Trade and Development Classifications. Because the number of middle authors varied widely across our sample, we assessed their affiliations on a binary basis, indicating only whether a contributor from an institution from the Global South was present outside of the lead and senior positions. Synthesizing these data, we assigned papers to mutually exclusive groups based on whether they included one or more Global South authors or only Global North authors. Next, we categorized each study’s sequencing approach as reduced representation, WGS, Sanger sequencing, microsatellites, or other, and described its overall focus using tiered categories based on discussion in Bertola et al. 2024. These tiers were: 1) Taxonomy / systematics, identification, or sexing; 2) Phylogeography / population genetic structure, estimating genetic diversity, and inferring demographic history; and 3) Detecting outlier loci, quantifying runs of homozygosity, and evaluating adaptive potential. When studies employed more than one sequencing approach or addressed goals belonging to multiple tiers, we assigned them to a single category based on their most data-intensive method or question. To explore geographic patterns in sequencing effort, we assessed whether each study’s taxonomic sampling included 1) at least one species distributed in the Global South and 2) at least one species distributed in the Global North. Because some studies included multiple taxa and some species are broadly distributed or migrate between regions, these categories were not mutually exclusive. To evaluate whether geographic representation in conservation genetics changed over the period covered by our review, we performed logistic regression using the stats package R v.4.4.0, treating the presence or absence of an author from the Global South as a binary outcome variable and year as the sole independent variable.

创建时间：

2025-08-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集