Avian point-counts from Rhode Island and Connecticut used to test species distribution models
收藏DataCite Commons2026-03-16 更新2026-04-25 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.8cz8w9gnp
下载链接
链接失效反馈官方服务:
资源简介:
Spatial-biases are a common feature of presence-absence data from citizen
scientists. Spatial thinning can mitigate errors in species distribution
models (SDMs) that use these data. When detections or non-detections are
rare, however, SDMs may suffer from class imbalance or low sample size of
the minority (i.e. rarer) class. Poor predictions can result, the severity
of which may vary by modeling technique. To explore the consequences of
spatial bias and class imbalance in presence-absence data, we used eBird
citizen science data for 102 bird species from the northeastern USA to
compare spatial thinning, class balancing, and majority-only thinning
(i.e., retaining all samples of the minority class). We created SDMs using
two parametric or semi-parametric techniques (generalized linear models
and generalized additive models) and two machine-learning techniques
(random forest and boosted regression trees). We tested the predictive
abilities of these SDMs using an independent and systematically collected
reference dataset with a combination of discrimination (area under the
receiver operator characteristic curve; true skill statistic; area under
the precision-recall curve) and calibration (Brier score; Cohen’s kappa)
metrics. We found large variation in SDM performance depending on thinning
and balancing decisions. Across all species, there was no single best
approach, with the optimal choice of thinning and/or balancing depending
on modeling technique, performance metric, and the baseline sample
prevalence of species in the data. Spatially thinning all the data was
often a poor approach, especially for species with baseline sample
prevalence < 0.1. For most of these rare species, balancing classes
improved model discrimination between presence and absence classes, but
hindered model calibration. Baseline sample prevalence, sample
size, modeling approach, and the intended application of SDM output –
whether discrimination or calibration – should guide decisions about how
to thin or balance data, given the considerable influence of these
methodological choices on SDM performance. For prognostic
applications requiring good model calibration (vis-à-vis discrimination),
the match between sample prevalence and true species prevalence may be the
overriding feature and warrants further investigation.
提供机构:
Dryad
创建时间:
2020-10-29



