Analysis of biodiversity data suggests that mammal species are hidden in predictable places
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.b2rbnzshp
下载链接
链接失效反馈官方服务:
资源简介:
Research in the biological sciences is hampered by the Linnean shortfall, which describes the number of hidden species that are suspected of existing without formal species description. Using machine learning and species delimitation methods, we built a predictive model that incorporates some 5.0 × 105 data points for 117 species traits, 3.3 × 106 occurrence records, and 9.1 × 105 gene sequences from 4,310 recognized species of mammals. Delimitation results suggest that there are hundreds of undescribed species in class Mammalia. Predictive modeling indicates that most of these hid- den species will be found in small-bodied taxa with large ranges characterized by high variability in temperature and precipitation. As demonstrated by a quantitative analysis of the literature, such taxa have long been the focus of taxonomic research. This analysis supports taxonomic hypotheses regarding where undescribed diversity is likely to be found and highlights the need for investment in taxonomic research to overcome the Linnean shortfall.
Methods
Genetic data: We downloaded all available mammalian DNA sequences for the mitochondrial genes cytochrome-c oxidase I (COI) and cytochrome-b (cytb) from the NIH genetic sequence database, GenBank. For each gene, we grouped sequences by species and then manually checked all species records for errors (e.g., subspecies, duplicates, extinct species, etc.). To ensure standardization across groups, we updated all sequence taxonomy to reflect that of the Mammal Diversity Database (MDD) published by the American Society of Mammologists. Following taxonomic standardization, we grouped sequence records for each gene by family and generated multiple sequence alignments for COI and cytb independently using MUSCLE v3.5. We then visually inspected each family-level alignment for gaps and removed problematic sequences causing severe gaps or misalignment that could not be resolved through reverse complement or manual correction.
Geographic data: We first downloaded all geographic coordinates for class Mammalia from the Global Biodiversity Information Facility (GBIF) and used these to extract data from several GIS layers, including elevation, the 19 BIOCLIM layers at 1-km resolution pertaining to temperature and precipitation available from the World- Clim database, population density, gross domestic product, light pollution, protected areas, anthropogenic biomes, and GlobCover by the European Space Agency. We then curated these occurrence records using the R package, Coordinate Cleaner.
创建时间:
2022-03-23



