five

Avian point-counts from Rhode Island and Connecticut used to test species distribution models

收藏
Mendeley Data2024-05-10 更新2024-06-27 收录
下载链接:
https://zenodo.org/records/4245079
下载链接
链接失效反馈
官方服务:
资源简介:
Spatial-biases are a common feature of presence-absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non-detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modeling technique. To explore the consequences of spatial bias and class imbalance in presence-absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing, and majority-only thinning (i.e., retaining all samples of the minority class). We created SDMs using two parametric or semi-parametric techniques (generalized linear models and generalized additive models) and two machine-learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision-recall curve) and calibration (Brier score; Cohen's kappa) metrics. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modeling technique, performance metric, and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence < 0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes, but hindered model calibration. Baseline sample prevalence, sample size, modeling approach, and the intended application of SDM output – whether discrimination or calibration – should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis-à-vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.

空间偏差是公民科学家获取的存在-缺失数据的常见特征。空间稀疏化(spatial thinning)可缓解使用此类数据构建的物种分布模型(Species Distribution Models, SDMs)中的误差。然而,当物种检出或未检出事件较为稀少时,SDMs可能面临类别不平衡或少数类(即更稀有的类别)样本量不足的问题,进而导致预测效果不佳,其严重程度随建模方法的不同而有所差异。为探究存在-缺失数据中空间偏差与类别不平衡带来的影响,我们使用美国东北部102种鸟类的eBird公民科学观测数据,对比了空间稀疏化、类别平衡以及仅对多数类进行空间稀疏化(即保留全部少数类样本)三种方法的效果。我们采用两种参数/半参数建模技术——广义线性模型(Generalized Linear Models, GLM)与广义加性模型(Generalized Additive Models, GAM),以及两种机器学习技术——随机森林(random forest)与提升回归树(boosted regression trees),构建了SDMs。我们采用独立且系统性采集的参考数据集,结合判别性能指标(含受试者工作特征曲线下面积、真实技能统计量、精确率-召回率曲线下面积)与校准性能指标(含布里尔分数、科恩卡帕系数),测试了各SDMs的预测能力。研究发现,SDMs的性能差异显著,取决于稀疏化与平衡策略的选择。对于所有物种而言,并无普适最优的方法,稀疏化和/或平衡的最优选择取决于建模技术、性能指标以及数据中物种的基线样本占比。对全部数据进行空间稀疏化通常效果不佳,尤其当物种的基线样本占比低于0.1时。对于多数此类稀有物种,类别平衡可提升模型对存在与缺失类别的判别能力,但会损害模型的校准性能。基线样本占比、样本量、建模方法以及SDM输出的预期应用场景(是侧重判别还是校准),均应指导数据稀疏化或平衡的决策,因为这些方法学选择对SDM性能的影响十分显著。对于需要良好模型校准性能(而非判别性能)的预测应用,样本占比与真实物种占比的匹配度可能是最关键的因素,值得进一步研究。
创建时间:
2023-06-28
二维码
社区交流群
二维码
科研交流群
商业服务