five

Data from: "Cross-realm transferability of species distribution models – species characteristics matter more than modelling methods applied"

收藏
Mendeley Data2024-05-10 更新2024-06-30 收录
下载链接:
https://zenodo.org/records/10952050
下载链接
链接失效反馈
官方服务:
资源简介:
Abstract This data contains occurrence observations (presence-absence) of 11 aquatic macrophytes from Bothnian Sea and Lake Puruvesi, and environmental covariates used to build species distribution models (SDMs) in paper "Cross-realm transferability of species distribution models – species characteristics matter more than modelling methods applied". At the moment the data and the code are supplied anonymously for double-blind peer review. In addition to data files, also R code for fitting the SDMs is supplied, as is the R code to replicate the analysis conducted in the paper. The data is stored in rdata format (point data), without coordinate information due to data policy restrictions. The species in the data are Isoëtes lacustris, Isoëtes echinospora, Ranunculus reptans, Ranunculus schmalhausenii, Potamogeton berchtoldii, Potamogeton perfoliatus, Potamogeton gramineus, Myriophyllum alterniflorum, Equisetum fluviatile, Eleocharis acicularis and Elodea canadensis. The environmental covariates are bottom water salinity, turbidity, sandy substrate occurrence, colored dissolved organic matter (CDOM), surface fetch, sampling depth, total nitrogen and total phosphorus, and distance to closest Phragmites australis reed bed. Objective of the study The modelling objective of the paper was species distribution model (SDM) transferability assesment. Transferability was assessed using models built in marine areas in projecting the distributions of the target species in Lake Puruvesi, Saimaa, Eastern Finland. Macrophyte mapping data from Lake Puruvesi was used as independent test data, against which transferability of the models was assessed. Location The species data was collected from two geographic areas: Bothnian Bay (Baltic Sea) and Lake Puruvesi (Eastern Finland). The marine observations from Bothnian Bay were split into three overlapping areas (areas 1-3), to test the effect of input data gradient length to SDM transferability. The largest marine sampling area (Area 3) ranged from 62.95, 65.91 latitude and 19.14, 27.99 longitude. Area 2 ranged from 63.95, 65.91 latitude and 21.53, 27.99 longitude. Area 1 ranged from 64.91, 65.91 latitude and 23.82, 27.99 longitude. The Hummonselkä subbasin of Lake Puruvesi, where the macrophyte test data was collected, is located at 61.89, 62.05 latitude and 29.58, 29.78 longitude. Species data The species observations were collected using diving transects placed in the floor of the sea or lake, and species observations were recorded in 2 m22 grid cells separated by 10 meters along the transect or 1 meters depth, depending which criteria was met first. The species data was collected in 2010 - 2020 from marine area, and 2017 from Lake Puruvesi. All macrophytes in 2 x 1 m frames were identified to species level by the diver, and the data contained information on species presence or absence in each grid cell. The diving transects were conducted using systematic survey protocol used in the underwater inventories of the Finnish Underwater Biodiversity Survey Program (VELMU). The locations of the diving transects were not randomly distributed, but were placed using expert judgement. As all our study species are macroscopic and relatively easily identifiable in the field (with the exception of possibility of mixing I. echinospora and I. lacustris), we consider the absences in our observation data to indicate true absences. That said, as the observation area is rather small (2 m2), it is possible that a species may be found in the site of investigation (e.g. a small lagoon) but be located outside the vegetation sampling grid. Environmental data Bottom water salinity The seasonal mean bottom water salinity was modeled using a generalized additive model with mean salinity as response, with log-link and gamma distribution for the errors. This was necessary to keep the resulting predictions positive. Bottom depth, CDOM, river influence and spatial location were used as predictors. Data from 448 locations were used and each location had a minimum of three observations. The model was validated using 30 % of the data left outside of the model fitting. The explained deviance of the model was 0.94 and the correlation between raw data and predicted values was 0.95 with few outliers. Turbidity Maps of turbidity (in FNU, Formazin Nephelometric Unit) were generated from Sentinel-2 Multi-Spectral Imager (MSI) observations using the Case-2 Regional Coast Colour (C2RCC) bio-optical inversion model, containing separate atmospheric correction and water quality parts. Before computing C2RCC, the original 10-meter input data was downsampled to 60 meters. The output variable of the C2RCC processor correlative to turbidity is the backscattering of total suspended sediments at 443 nm, which was further calibrated into turbidity (FNU) values using SYKE's empirical equations for coastal waters and clear lakes (for a similar approach, see Attila et al. (2013) and Sagerman, Hansen, and Wikström (2020)). Monthly observations of turbidity were aggregated into median composites to reduce the effects of cloud cover and other disturbances. Due to low solar elevation and ice cover in winter, the turbidity distribution maps are generated only for the summer months (May to September). The current processing covers years 2017 to 2021. An average raster layer was created from monthly observations as input for SDM building. Probability of sandy substrate Random forest model was used to classify sandy bottoms from Sentinel 2 MSI satellite images in shallow water areas. Identifying sandy substrate is based on the higher reflectance compared to other substrates. The model was trained and validated using diver recorded field observations in the Baltic area, and diver recorded and echo sounding observations in the freshwater area. For full coverage including areas beyond the shallow water, the satellite image classification was combined with boosted regression tree modelling result in the Baltic, and echo sounding based product in the freshwater region. The resulting layers were probabilities of sandy substrate with 10-meter cell resolution. CDOM We used different methods to estimate the CDOM levels in Bothnian Bay and Lake Puruvesi, based on biogeochemical model data and satellite images. For Lake Puruvesi, we applied the Finnish Environment Institute's (Syke) in-house CDOM algorithm to the Sentinel-2 MSI images processed by the C2RCC bio-optical processor (Brockmann et al. 2016). The observations in 10 m resolution were aggregated as monthly averages for each month of the summer season (May to October) from 2017 to 2021. For Bothnian Bay, we used Syke's in-house Sentinel-2 MSI CDOM layers (resolution: 60 m) aggregated as seasonal averages (1 Jul to 7 Sep). CDOM values are given as absorption coefficient of CDOM at 400 nm [m⁻¹]. Surface fetch A surface fetch raster was produced to the Puruvesi and Bothnian Bay. The analysis required a feature layer of shorelines from Puruvesi and Baltic Sea. First, we created polyline from north to south spanning over the whole area of interest with a gap of 20 meters which is also the resolution of the output raster. These lines were then cut each time they hit the shoreline and the part of the line that was overlapping land was removed. The distance of the remaining lines was then calculated and a point with the distance value was created every 20 meters. Each time the line was cut when hitting an island for example and starting again from the other side of the island, the distance calculation started from 0. This created a point dataset with a distance value in each point. We repeated the procedure for 15 times for different compass directions with 22.5 degree intervals and calculated average fetch for each point location on 20 meters grid from these 15 point layers. Depth Depth was measured by a diver using a dive computer while surveying each vegetation grid cell, and measured depth was used when projecting model results to Puruvesi (transferability performance). In addition, a depth model for the freshwater region was created from Sentinel 2 MSI satellite image using the logarithmic band ratio model of blue and red band. The model was calibrated using diver recorded field observations and validated against echo sounding measurements. For more complete coverage and to include deep areas, echo soundings from multiple sources were combined with the satellite derived bathymetry. The cell resolution of the resulting depth layer was 10 meters. Total nitrogen and phosphorus Mean total nitrogen and phosphorus layers for marine area were produced using ArcGIS "splines with barriers" tool for the EEZ of Finland with 20 meters spatial resolution (Virtanen et al. 2018). Summer (July - September) nutrient measurements from 0 to 10 meters depth between 2010 and 2020, obtained from the VESLA database, were used as input data for the interpolation. Nitrogen and phosphorus measurements in Puruvesi between 2010 and 2020 was gathered from the VESLA database. Data from July to September was selected to represent the growing season. A mean value of NTOT and PTOT was then calculated for each location. Spline with Barriers (SwB) tool was used to interpolate the values (Arcmap 10.7.1). The tool uses a feature layer as barrier to create the raster representing only the area of interest. For the barrier and the extent of the interpolated raster we used a shapefile representing Lake Puruvesi shoreline. The resolution was set to 5x5 meters. SwB tool created an "extent box" around the area of interest which was removed with Extract by Mask tool using the shoreline feature layer. After the interpolation we noticed that either one of the locations was situated on land or the polygon used as barrier was "leaking". SwB doesn´t interpolate areas that doesn't have locations with values or aren´t connected to the main body of water. To fix this, the raster was extended outwards based on the values of nearby cells and after that the raster was masked again to remove any cells on land. The phosphorus interpolation provided negative values in southern parts on Enanlahti in Kontiolahti and Muholanlahti. These negative values were caused by considerably larger phosphorus values in Enanlahti Lamminniemi (9m) Enanlahti Lamminniemi (4m) locations when compared with the nearby Puruvesi Enanlahti location. The interpolation apparently continued to decrease the values according to the trend set by the difference between these locations and caused it to reach negative values. The southern parts of the bay, about 750 meters, was removed and new values were calculated based on the surrounding cells with Focal Statistics tool. The interpolations were validated by removing 20 % of the locations and reproducing the interpolation. The removed locations and their values were then compared to the interpolated raster. R∗2∗2 value from phosphorus interpolation model was 0.91 after removing two outliers and R22 value from nitrogen interpolation model was 0.715 after removing one outlier. Distance to closest reed The aquatic vegetation (Phragmites australis reeds) presence/absence maps were also generated from Sentinel-2 MSI data. The processing included extracting one month of data (July 2019) from green and near-infra-red bands from Sentinel-2 Global Mosaic (S2GM) service and transforming those to normalized-difference vegetation indices (NDVIs). After that, Bayesian statistics were used to predict the posterior probability of vegetation occurrence when distance from shore and NDVI were used as predictor variables. The posterior variable was thresholded and the resulting vegetation presence areas were sieved so that both too small vegetation areas (fewer than 5 pixels) or areas that were not directly attached to shoreline were removed. The resulting map has 10 m pixel size and tentatively represents the locations of reed belts or other shoreline-attached vegetation. This EO-based layer could also be referred to as helophytes or helophytic macrophytes, as it denotes a specific zone of vegetation with emergent aquatic plants containing leaf-green, particularly those that grow densely and have horizontally oriented leaves. In some lakes, this layer can represent, for example, thick stands of Equisetum fluviatile, although in most cases, it is associated with common reed belts. The approach is described in more detail in Koponen et al. (2022). Data partitioning Data was partitioned with 70/30 splitting into training and test (interpolation accuracy) data. The splitting was repeated 100 times for each species by randomly selecting 70 % of observations which were used to build each of the SDMs (GLM, GAM, BRT and BART). The partitioning was repeated for each of the three input data areas and 11 species. The input data indexes for replicating the split are supplied in the data files. R code Code files contain scripts for fitting the SDM models described in the paper using the data. Also code for beta regression analyses for the analysis of the modelling results conducted in the paper, are supplied.

**摘要** 本数据集包含来自波的尼亚湾(Bothnian Sea)和普鲁韦西湖(Lake Puruvesi)的11种水生维管植物的出现-缺失(presence-absence)观测记录,以及用于构建论文《物种分布模型的跨域可迁移性——物种特性比建模方法更重要》中物种分布模型(species distribution models, SDMs)的环境协变量。目前本数据集与代码以匿名形式提交,用于双盲同行评议。除数据文件外,本包还提供了用于拟合物种分布模型的R代码,以及可复现论文中所有分析流程的R代码。由于数据政策限制,本数据集以rdata格式存储(点位数据),不包含坐标信息。数据中的物种包括:水韭(Isoëtes lacustris)、刺叶水韭(Isoëtes echinospora)、匐枝毛茛(Ranunculus reptans)、施马尔豪森毛茛(Ranunculus schmalhausenii)、伯氏眼子菜(Potamogeton berchtoldii)、穿叶眼子菜(Potamogeton perfoliatus)、草叶眼子菜(Potamogeton gramineus)、异花狐尾藻(Myriophyllum alterniflorum)、溪生木贼(Equisetum fluviatile)、针蔺(Eleocharis acicularis)以及加拿大伊乐藻(Elodea canadensis)。本数据集的环境协变量包括:底层水盐度、浊度、沙质底质占比、有色溶解有机物(colored dissolved organic matter, CDOM)、水面吹程(surface fetch)、采样水深、总氮、总磷,以及至最近的芦苇(Phragmites australis)床的距离。 **研究目标** 本论文的建模目标为物种分布模型(SDMs)的可迁移性评估。研究通过在海洋区域构建的模型,对芬兰东部塞马湖的普鲁韦西湖中目标物种的分布进行预测,以此评估模型的可迁移性。研究采用普鲁韦西湖的水生植物调查数据作为独立测试集,以此验证模型的可迁移性表现。 **研究区域** 物种观测数据采集自两个地理区域:波的尼亚湾(波罗的海)以及芬兰东部的普鲁韦西湖。研究将波的尼亚湾的海洋观测数据划分为三个重叠区域(区域1-3),以探究输入数据梯度长度对物种分布模型可迁移性的影响。最大的海洋采样区域(区域3)的纬度范围为62.95°N~65.91°N,经度范围为19.14°E~27.99°E;区域2的纬度范围为63.95°N~65.91°N,经度范围为21.53°E~27.99°E;区域1的纬度范围为64.91°N~65.91°N,经度范围为23.82°E~27.99°E。水生植物测试数据采集自普鲁韦西湖的Hummonselkä子流域,其地理坐标为纬度61.89°N~62.05°N,经度29.58°E~29.78°E。 **物种数据采集方法** 物种观测通过布设至海底或湖底的潜水样线完成,物种出现-缺失记录被录入2 m²的网格单元中,网格单元沿样线的间距为10米,或按1米水深间隔划分,以先满足的条件为准。海洋区域的物种数据采集于2010年至2020年,普鲁韦西湖的物种数据采集于2017年。潜水员会对2 m × 1 m样方内的所有水生植物进行物种水平鉴定,数据记录每个网格单元内的物种出现或缺失情况。本次潜水样线调查采用芬兰水下生物多样性调查项目(VELMU)的水下系统调查规程。潜水样线的布设并非随机,而是基于专家经验确定点位。由于本研究涉及的所有物种均为大型水生植物,且野外可较为轻松地进行物种鉴定(仅刺叶水韭与水韭存在混淆的可能),因此本观测数据中的物种缺失记录可视为真实的未出现情况。但需注意,由于单个观测单元面积较小(2 m²),在调查点位(如小型潟湖)中可能存在物种存在但未被采样网格覆盖的情况。 **底层水盐度** 研究以底层水平均盐度为响应变量,采用广义可加模型(generalized additive model, GAM)进行建模,使用对数连接函数与伽马误差分布以确保预测值为正。模型以水深、有色溶解有机物(CDOM)、河流影响因子与空间位置作为预测变量。本次建模共使用448个点位的观测数据,每个点位至少包含3次重复观测。模型验证采用留出30%数据的交叉验证方法。模型的解释偏差为0.94,原始观测值与预测值的相关系数为0.95,仅存在少量异常值。 **浊度** 本研究采用二类水体区域海岸颜色(Case-2 Regional Coast Colour, C2RCC)生物光学反演模型,从哨兵二号多光谱成像仪(Sentinel-2 Multi-Spectral Imager, MSI)的观测数据中生成浊度空间分布图(单位为福尔马肼浊度单位,Formazin Nephelometric Unit, FNU),该模型包含独立的大气校正与水质分析模块。在运行C2RCC模型前,原始10米分辨率的输入数据被重采样至60米分辨率。C2RCC模型输出的与浊度相关的变量为443 nm波长处总悬浮颗粒物的后向散射系数,研究采用芬兰环境研究所(SYKE)针对近岸水域与清澈湖泊的经验公式,将该系数进一步校准为浊度(FNU)值(相关方法可参考Attila等(2013)与Sagerman、Hansen及Wikström(2020)的研究)。为减少云量与其他干扰的影响,研究将月度浊度观测数据聚合为中值合成数据集。由于冬季太阳高度角较低且存在冰盖,浊度空间分布图仅生成夏季(5月至9月)的结果。本次处理覆盖2017年至2021年的数据,研究从月度观测数据中生成平均栅格图层,作为构建物种分布模型的输入数据。 **沙质底质占比** 研究采用随机森林模型,从浅水区的哨兵二号MSI卫星影像中分类沙质底质。沙质底质的识别基于其相较于其他底质更高的光谱反射率。模型的训练与验证采用波罗的海区域的潜水现场观测数据,以及淡水区域的潜水观测与回声测深数据。为实现全区域覆盖(包括浅水区以外的区域),研究将卫星影像分类结果与波罗的海区域的提升回归树模型结果、淡水区域的回声测深产品进行融合。最终生成的栅格图层为沙质底质出现概率,空间分辨率为10米。 **有色溶解有机物(CDOM)** 研究基于生物地球化学模型数据与卫星影像,采用不同方法估算波的尼亚湾与普鲁韦西湖的CDOM浓度。针对普鲁韦西湖,研究将芬兰环境研究所(SYKE)自研的CDOM算法应用于经C2RCC生物光学处理器处理后的哨兵二号MSI影像(Brockmann等,2016)。研究将10米分辨率的观测数据聚合为2017年至2021年夏季(5月至10月)各月的平均CDOM浓度。针对波的尼亚湾,研究采用SYKE自研的哨兵二号MSI CDOM图层(分辨率60米),将其聚合为季平均数据(7月1日至9月7日)。CDOM浓度以400 nm波长处的CDOM吸收系数[m⁻¹]表示。 **水面吹程** 研究为普鲁韦西湖与波的尼亚湾生成了水面吹程栅格图层。本次分析需要普鲁韦西湖与波罗的海的岸线矢量图层作为基础数据。首先,研究生成覆盖整个研究区域的南北向折线,折线的间隔为20米(与输出栅格的分辨率一致)。当折线与岸线相交时,对其进行裁剪,并移除落入陆地的折线段。随后计算剩余折线的长度,并每20米生成一个带有该距离值的点位。当折线遇到岛屿等障碍物被截断后,将从岛屿另一侧重新开始计算距离,此时距离值从0重新累计。以此生成包含每个点位吹程距离的点数据集。研究以22.5°为间隔,针对15个不同的罗盘方位重复上述流程,从生成的15个点图层中计算每个20米分辨率网格点的平均吹程。 **水深** 潜水员在调查每个植物样方网格时,使用潜水电脑记录实测水深,该实测水深数据用于将模型结果投影至普鲁韦西湖区域以评估模型可迁移性。此外,研究基于哨兵二号MSI卫星影像,采用蓝红波段对数比值模型构建了淡水区域的水深模型。模型的校准采用潜水现场观测数据,验证则采用回声测深数据。为实现全区域覆盖并包含深水区数据,研究将多源回声测深数据与卫星反演水深数据进行融合。最终生成的水深栅格图层分辨率为10米。 **总氮与总磷** 研究采用ArcGIS的“带障碍样条插值”工具,针对芬兰专属经济区生成了海洋区域的总氮与总磷平均浓度栅格图层,空间分辨率为20米(Virtanen等,2018)。插值的输入数据为2010年至2020年间从VESLA数据库获取的0~10米水深的夏季(7月至9月)营养盐观测数据。2010年至2020年间普鲁韦西湖的氮磷观测数据同样从VESLA数据库获取。研究选取7月至9月的数据代表水生植物的生长季,并为每个观测点位计算总氮(NTOT)与总磷(PTOT)的平均浓度。本次插值采用带障碍样条插值(Spline with Barriers, SwB)工具(ArcMap 10.7.1),该工具通过矢量障碍图层,仅在研究区域内生成插值栅格。研究采用普鲁韦西湖岸线的shapefile文件作为插值障碍与栅格范围,插值栅格的分辨率设置为5米×5米。SwB工具会在研究区域外围生成“范围框”,研究通过岸线矢量图层的掩膜提取工具移除该范围框。插值完成后,研究发现部分点位位于陆地区域,或插值障碍多边形存在“泄漏”问题。SwB工具不会对无观测点位或与主水体不连通的区域进行插值。为解决该问题,研究基于邻近单元格的数值将栅格向外扩展,随后再次通过掩膜提取工具移除陆地区域的单元格。在Kontiolahti的Enanlahti湾与Muholanlahti湾的南部区域,总磷插值结果出现了负值,该负值的产生源于Enanlahti Lamminniemi(9m水深)与Enanlahti Lamminniemi(4m水深)点位的磷浓度显著高于邻近的普鲁韦西湖Enanlahti点位。插值算法会根据点位间的浓度梯度持续降低预测值,最终导致部分区域出现负值。研究移除了该海湾南部约750米的区域,并通过邻域统计工具基于周围单元格的数值重新计算该区域的磷浓度。插值结果的验证采用留出20%观测点位的交叉验证方法:移除部分点位后重新进行插值,再将预测值与实测值进行对比。移除两个异常值后,总磷插值模型的决定系数(R²)为0.91;移除一个异常值后,总氮插值模型的决定系数为0.715。 **至最近芦苇床的距离** 研究同样基于哨兵二号MSI数据生成了水生植被(芦苇,Phragmites australis)的出现-缺失分布图。数据处理流程包括从哨兵二号全球镶嵌(Sentinel-2 Global Mosaic, S2GM)服务中提取2019年7月的绿光与近红外波段数据,并计算归一化植被指数(NDVI)。随后采用贝叶斯统计方法,以离岸距离与NDVI为预测变量,预测植被出现的后验概率。对后验概率进行阈值分割后,对生成的植被出现区域进行筛选:移除面积小于5个像素的区域,以及未直接与岸线相连的区域。最终生成的植被分布图空间分辨率为10米,可近似代表芦苇带或其他与岸线相连的水生植被的分布位置。该基于遥感的图层也可称为湿生植物或湿生水生植物图层,其代表的是具有直立茎叶的水生植物群落(通常叶片呈绿色、生长密集且叶片水平伸展)的分布区域。在部分湖泊中,该图层也可代表密集生长的溪生木贼群落,但在大多数情况下,其对应的为芦苇带。该方法的详细细节可参考Koponen等(2022)的研究。 **数据划分** 本数据集按照7:3的比例划分为训练集与测试集(用于插值精度评估)。针对每个物种,研究随机选取70%的观测数据用于构建物种分布模型(包括广义线性模型GLM、广义可加模型GAM、提升回归树BRT与贝叶斯加法回归树BART),该划分流程重复100次。针对三个海洋采样区域与11个物种,研究均重复了上述数据划分流程。数据文件中提供了可复现数据划分的输入数据索引。 **R代码** 代码文件包含了使用本数据集拟合论文中所述物种分布模型的脚本,同时提供了用于分析论文中建模结果的β回归分析代码。
创建时间:
2024-04-13
二维码
社区交流群
二维码
科研交流群
商业服务