Simulated Herbarium data for testing the accuracy with which specimen data can predict the timing and duration of population-level flowering displays

Mendeley Data2024-04-13 更新2024-06-30 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.5061/dryad.dbrv15f79

下载链接

链接失效反馈

官方服务：

资源简介：

Creating a reference dataset: generating sample locations representing known population-level phenological distributions and individual phenological parameters We simulated phenological data for 1200 hypothetical “species” in the coterminous USA that varied in the attributes of their individual- and population-level flowering phenology. For each of these simulated species, we selected 1000 locations within the continental United States, each representing a local population observed during a single year from which a simulated specimen was later obtained. The coordinates for each location, year, and associated mean annual temperature in the year of collection were randomly selected without replacement from 4-km2 PRISM pixels (PRISM Climate Group 2011) between the years 1901 to 2020, and were restricted to locations with 1991–2020 temperature normals of 1–20 °C and mean annual precipitation normals for the same period of 60–3800 mm. Each species generated this way was assigned a series of attributes defining its individual- and population-level flowering phenology. The peak flowering date of an individual was assumed to coincide with its mean flowering date. We then defined a linear equation describing the relationship between the mean date of peak flowering among individuals within a population and local temperature conditions. Each species was assigned a median population flowering DOY of 50 at 0˚C (i.e., the intercept) as well as a phenological responsiveness (i.e., slope) of median flowering DOY to mean annual temperature: advancing by 1, 4, or 8 days per increase in °C. Next, we assigned each species a low or high magnitude of intrapopulation variation in phenological timing (i.e., in peak flowering DOYs) among individuals (based on normal distributions with standard deviations (σ) of either 10 or 30 days), representing the magnitude of variation in the flowering times of early- to late-flowering individuals within each local population. Then, each species was assigned a short, moderate, or long duration of the flowering period by each individual within each population (15, 30, or 60 days, representing the duration of time each individual plant was in flower. Fifty species were simulated for each of these 18 combinations of phenological responsiveness, flowering duration, and intrapopulation variation in phenological timing. To accommodate the possibility that the magnitude of variation in phenological timing within a population could depend on local climate conditions, we also simulated 50 species with temperature-sensitive intrapopulation phenological variation (σ) ranging from 10 to 30 days. For these species, σ of the DOY among individuals in a given population increased by 1 day for every 1 °C increase in the mean annual temperature of its location. For these simulated species, individual flowering duration was fixed at 30 days. Additionally, to accommodate the possibility that individual flowering durations could exhibit linear relationships with local climate conditions, we also simulated 50 species that exhibited individual-level variation in flowering duration resulting from changes in temperature (increasing by 1 day per °C increase in mean annual temperature, and ranging from 10 days to 30 days). For these species, the degree of intrapopulation variation in peak flowering dates was held constant at σ = 30 days (i.e., high intrapopulation variation). Calculation of population-level onset, median, and termination dates of flowering For each population of each species described above, we calculated a distribution of individual-level peak flowering dates—assumed to be normally distributed (Clark and Thompson 2011)—based on the flowering attributes of the species and the temperature conditions corresponding to its site and year of observation. First, we calculated the median flowering DOY at the location and year from which each specimen was collected based on its pre-defined intercept and phenological responsiveness to mean annual temperature (i.e., 1, 4, and 8 days per °C). Then, we obtained the standard deviation of each local population (i.e., its degree of intrapopulation variation in flowering dates) based on the flowering attributes of the simulated species as outlined above. Next, we arbitrarily defined population-level flowering onset DOYs for each population and year as the 10th percentile of a normally distributed population whose mean and standard deviation we obtained in the previous steps (i.e., the DOYs by which the first 10% of individuals in a local population at a given location and year would have reached their median flowering dates). Similarly, the population-level flowering termination dates were calculated as the 90th percentile of a normally distributed population with the same characteristics as described above (i.e., the DOYs by which all but 10% of individuals in a local population at a given location and year would have reached their peak (or mean) flowering dates). Through this process, we obtained a sample of 1000 annual population-level distributions of flowering dates for each of 1200 hypothetical species. For each of these populations, the quantiles of their flowering distribution—representing the nth individual reaching peak flowering within a population—were known a priori, representing a benchmark against which to compare estimates derived from simulated specimen data. Simulating randomly selected (unbiased) phenological snapshots from pre-defined populations For each species, we then generated simulated specimens by: (1) randomly selecting an individual within each population and (2) selecting a random DOY within its individual-level flowering period that emulated the phenological snapshot provided by real herbarium specimens. Specifically, using the distribution of peak flowering dates of each population, we selected an individual at random. From its peak flowering date, we then obtained onset and termination dates by subtracting (for flowering onset) or adding (for flowering termination) half the individual’s flowering duration for that species to the sampled date of peak flowering. To simulate a phenological snapshot for that individual, we then randomly selected a DOY between the onset and termination of that individual’s flowering period. As a result, the simulated datum represented a simulated herbarium specimen generated accounting for uncertainty in both the timing of the individual relative to its source population, and in the timing of the collection relative to the onset and termination of that individual’s flowering period. This procedure was repeated across all locations for each simulated species, generating 1000 data points (i.e., simulated specimens or phenological snapshots) per species. Simulating biases in collection effort across population-level flowering periods To simulate biases towards collection of specimens during the early or late portion of their local population-level flowering displays, we selected an individual at random within each population and year using both left- and right-skewed normal probability distributions. These distributions were constructed by modulating the parameter α in the python package scipy.stats.skewnorm v1.10.1 (Azzalini and Capitanio 1998), such that if the underlying plant population was treated as exhibiting a normal distribution (α = 0), samples were collected from that population with a left-skewed (α = -1.0) or right-skewed (α = -1.0) probability distribution. Once an individual was selected from these skewed distributions, the timing of sample collection from within the individual flowering durations of these ‘specimens’ were generated using similar methods as unbiased specimens. We then determined the accuracy of the model predictions generated from datasets exhibiting biased and unbiased sampling of local populations by comparing predicted population-level flowering onset and termination dates with the actual (i.e., known, simulated) flowering dates that were produced using a normal distribution. To minimize computation time, population-level biases were examined only for the subset of species for which phenological responsiveness to mean annual temperature equaled 4 days/˚C (representing moderate responsiveness to climate stimuli), intrapopulation variation was high (σ = 30), and individual flowering duration was moderate (30 days). Simulating biases in the timing of collection within flowering periods of individuals In addition to biases towards collection of early or late individuals within a local population, botanists may also preferentially collect individuals from the early or late portion of their individual flowering period (i.e., individual collection bias). In some cases, collectors may preferentially collect individuals that are proximate to their peak flowering date because this is when the most flowers are displayed. In other cases, collectors may preferentially collect specimens that have only recently begun to flower, when floral structures may exhibit less damage from inclement weather or herbivores, or proximate to flowering termination in cases where the collector prefers specimens that include both flowers and fruits. Accordingly, for each population of each species, we simulated DOYs within each individual’s flowering period both at random (i.e., without bias) and with three different types of bias. Unbiased collections were simulated by selecting a random date chosen uniformly within the flowering period of each sampled individual. To represent a bias toward collection of individuals close to their peak (median) flowering DOY, we sampled collection dates from a truncated normal distribution centered on an individual’s mean flowering date and with σ = 25% of the flowering duration for that species and location (henceforth referred to as mean-biased collection data). To represent a bias toward collection dates shortly after flowering onset (henceforth, onset-biased collection data), we sampled collection dates from a truncated normal distribution centered on a date 25% earlier than the mean flowering onset date of that individual (σ = 25%). Finally, to represent a bias toward collection on dates shortly before flowering termination (henceforth termination-biased collection data), we sampled collection dates from a truncated normal distribution centered on a date 25% later than the mean flowering onset date of that individual (σ = 25%). As with examinations of population-level bias, collection biases within the flowering periods of individuals were examined only for the subset of species for which phenological responsiveness to mean annual temperature equaled 4 days/˚C, intrapopulation variation was high (σ = 30), individual flowering duration was moderate (30 days), and no population-level bias was present. Estimating population-level flowering onsets and terminations from simulated herbarium data We generated phenoclimate models for each species from each set of simulated specimen collection dates using quantile regression (Koenker et al. 2018) in RStudio (R Team 2020). In all cases, each model regressed observed DOYs of the phenological snapshots of all sampled individuals of a given species against mean annual temperature. From these 1450 models (representing each of the species-specific models for all 1200 species plus the additional 150 models exhibiting population-level collection biases and the 100 models exhibiting individual-level collection biases), we predicted the 10th, 50th, and 90th percentiles of flowering DOYs for each species from mean annual temperatures corresponding to the years and locations of their source populations. We then calculated the mean absolute error (MAE) of the linear regression of the known timing of the onset (or termination) of the peak flowering period for each reference population on the predicted DOYs produced by each phenoclimate model based on the simulated herbarium data. For each metric of population-level phenology (i.e., flowering onset, peak (i.e., median DOY), and termination), we then used Tukey HSD tests to compare the mean accuracies (estimated as MAE) of these predicted DOYs versus the actual population-level metrics among models constructed from species that differed in their phenological sensitivities to climate, flowering durations, degrees of intrapopulation variation in phenological timing, and collection biases. Similarly, we tested whether the mean MAE of estimated peak flowering onset and termination dates among groups of species that exhibited the same flowering duration, phenological responsiveness, and intrapopulation phenological variation differed significantly from the mean MAE of estimated median flowering dates for each group of simulated species that exhibited the same flowering duration, phenological responsiveness, and intrapopulation phenological variation. We used Tukey HSD tests to compare the accuracy of estimated onset, median, and termination dates of the peak flowering period among all species produced from each of the simulated datasets. Finally, we re-fit all 1200 models (including all 24 combinations of species parameters but excluding models constructed to test the effects of collection biases) with randomly selected subsets of data (100–1000 specimens per species) to determine how sample size affected model performance and predictive accuracy. To evaluate whether more data would be needed when variation in phenology among populations is not perfectly explained by the climate variables included in the model, we ran additional simulations in which population-level mean DOYs (and associated onset and termination DOYs of the flowering period) of each species at each sampled location and year included random variation not associated with local climate: adding either ±5 days (i.e., a low-noise scenario) or ±15 days (i.e., a high-noise scenario) to the DOYs of the onset, median, and termination of flowering DOYs. For each location and year, the random offsets of the DOYs of flowering onset, median flowering DOY, and flowering termination were identical, such that random variation was incorporated only into the timing of flowering, and not its duration.

创建参考数据集：生成代表已知种群级物候分布与个体物候参数的采样点位。我们针对美国本土连续区域的1200个假想"物种"模拟了开花物候数据，这些物种在个体与种群水平的开花物候属性上存在差异。针对每个模拟物种，我们在美国本土范围内选取1000个点位，每个点位代表一个本地种群，于单一年份开展观测，后续从中获取模拟标本。每个点位的坐标、观测年份以及采集当年的年平均气温，均从1901年至2020年间的4平方千米PRISM像素（PRISM Climate Group，2011）中无放回随机选取，且限制在1991-2020年气候常态下年平均气温为1~20℃、同期年平均降水常态为60~3800毫米的点位。每个通过该方式生成的物种被赋予一系列定义其个体与种群水平开花物候的属性。假定个体的盛花期与其平均开花日期一致。随后我们定义了线性方程，用于描述种群内个体的平均盛花期与局地温度条件之间的关系。为每个物种设定0℃时的种群平均开花日序（DOY，Day of Year）中位数为50（即截距），同时设定种群平均开花日序对年平均气温的物候响应度（即斜率）：每升高1℃，开花日序提前1、4或8天。随后，为每个物种赋予种群内个体间物候期（即盛花DOY）的低或高水平的变异幅度（基于标准差σ为10或30天的正态分布），以此表征每个本地种群内早开花与晚开花个体的开花时间变异程度。随后，为每个物种设定种群内每个个体的开花持续时长为短、中或长（分别为15、30或60天，代表每株个体的开花持续时间）。针对上述物候响应度、开花持续时长与种群内物候期变异程度的18种组合，每种组合各模拟50个物种。为考虑种群内物候期变异幅度可能依赖局地气候条件的情况，我们还模拟了50个种群内物候变异σ随温度变化的物种，其σ范围为10~30天。对于这些物种，其所在点位的年平均气温每升高1℃，种群内个体的DOY标准差便增加1天。这类模拟物种的个体开花持续时长固定为30天。此外，为考虑个体开花持续时长可能与局地气候存在线性关系的情况，我们还模拟了50个物种，其个体开花持续时长随温度变化（年平均气温每升高1℃，开花持续时长增加1天，范围为10~30天）。这类物种的种群内盛花期变异程度固定为σ=30天（即高水平种群内变异）。 ### 计算种群水平的开花起始、中位与终止日期针对上述每个物种的每个种群，我们基于该物种的开花属性以及对应点位与观测年份的温度条件，计算个体水平盛花期的分布——假定其服从正态分布（Clark和Thompson，2011）。首先，基于预定义的截距与对年平均气温的物候响应度（即每℃提前1、4或8天），计算每个标本采集点位与年份的中位开花DOY。随后，基于前文概述的模拟物种的开花属性，获取每个本地种群的标准差（即种群内开花日期的变异程度）。接下来，我们任意将每个种群与年份的种群水平开花起始DOY定义为正态分布的10%分位数，该正态分布的均值与标准差由前述步骤获得（即代表在给定点位与年份的本地种群中，前10%的个体达到盛花期的DOY）。类似地，种群水平的开花终止日期被定义为上述正态分布的90%分位数（即代表在给定点位与年份的本地种群中，除10%个体外其余均达到盛花期（或平均开花日期）的DOY）。通过该流程，我们为1200个假想物种中的每个物种获取了1000个年度种群水平开花日期分布样本。对于每个种群，其开花分布的分位数（代表种群内第n个达到盛花期的个体）均为预先已知，可作为基准用于对比从模拟标本数据推导得到的估计值。 ### 模拟从预定义种群中随机选取的（无偏）物候快照针对每个物种，我们通过以下步骤生成模拟标本：(1) 在每个种群内随机选取一个个体；(2) 在该个体的开花期内随机选取一个DOY，以此模拟真实标本馆标本的物候快照。具体而言，利用每个种群的盛花期分布，随机选取一个个体。从其盛花期开始，通过从采样的盛花期日期减去（开花起始）或加上（开花终止）该物种个体开花持续时长的一半，即可获取该个体的开花起始与终止日期。为模拟该个体的物候快照，我们随后在该个体开花期的起始与终止日期之间随机选取一个DOY。最终，该模拟数据代表了同时考虑个体相对于其源种群的时间不确定性，以及采集时间相对于该个体开花期起始与终止的不确定性的模拟标本馆标本。该流程针对每个模拟物种的所有点位重复执行，每个物种生成1000个数据点（即模拟标本或物候快照）。 ### 模拟种群水平开花期内采集采样强度的偏差为模拟偏向于在本地种群开花期的早期或晚期部分采集标本的情况，我们使用左偏与右偏正态概率分布，在每个种群与年份内随机选取一个个体。这些分布通过调整Python包scipy.stats.skewnorm v1.10.1（Azzalini和Capitanio，1998）中的参数α构建：当植物种群服从正态分布时（α=0），采样可从左偏（α=-1.0）或右偏（α=1.0）的概率分布中进行。一旦从这些偏态分布中选取了个体，这些"标本"的采集时间则通过与无偏标本类似的方法生成，即从该个体的开花持续时长内选取。随后，我们通过将预测的种群水平开花起始与终止日期与通过正态分布生成的实际（即已知、模拟）开花日期进行对比，评估从存在偏倚与无偏采样的本地种群数据集生成的模型预测精度。为最小化计算时间，我们仅针对物候响应度为4天/℃（代表对气候刺激的中等响应）、种群内变异水平较高（σ=30）且个体开花持续时长为中等（30天）的物种子集开展种群水平偏倚的分析。 ### 模拟个体开花期内采集时间的偏倚除了偏向于在本地种群内采集早期或晚期个体之外，植物学家也可能优先在个体开花期的早期或晚期部分采集个体（即个体采集偏倚）。在某些情况下，采集者可能优先采集接近盛花期（中位DOY）的个体，因为此时开花数量最多。在其他情况下，采集者可能优先采集刚刚开始开花的标本，此时花结构受恶劣天气或植食动物的损害较少，或者在接近开花终止时采集，此时采集者希望获取同时包含花与果实的标本。据此，针对每个物种的每个种群，我们模拟了在个体开花期内的DOY采样，包括无偏（即无偏倚）与三种不同类型的偏倚场景。无偏采集通过在每个采样个体的开花期内均匀随机选取日期来模拟。为代表偏向采集接近盛花期（中位DOY）的个体的偏倚，我们从以个体平均开花日期为中心、标准差为该物种与点位开花持续时长的25%的截断正态分布中采样采集日期（以下简称均值偏倚采集数据）。为代表偏向开花起始后不久的采集日期（以下简称起始偏倚采集数据），我们从以该个体开花起始日期的25%提前量为中心的截断正态分布中采样采集日期（σ=25%）。最后，为代表偏向开花终止前不久的采集日期（以下简称终止偏倚采集数据），我们从以该个体开花起始日期的25%滞后量为中心的截断正态分布中采样采集日期（σ=25%）。与种群水平偏倚的分析一致，个体开花期内的采集偏倚分析仅针对以下物种子集：物候响应度为4天/℃、种群内变异水平较高（σ=30）、个体开花持续时长为中等（30天）且无种群水平偏倚的物种。 ### 从模拟标本馆数据估算种群水平开花起始与终止日期我们使用RStudio（R团队，2020）中的分位数回归（quantile regression，Koenker等人，2018），基于每组模拟标本采集日期为每个物种构建物候-气候模型。在所有场景中，每个模型均将给定物种的所有采样个体的物候快照观测DOY与年平均气温进行回归。基于这1450个模型（代表1200个物种的每个物种专属模型，外加150个存在种群水平采集偏倚的模型与100个存在个体水平采集偏倚的模型），我们从对应源种群的年份与点位的年平均气温中，预测每个物种开花DOY的10%、50%与90%分位数。随后，我们基于模拟标本馆数据，计算每个参考种群的已知盛花期起始（或终止）时间的线性回归的平均绝对误差（MAE，Mean Absolute Error），并将其与每个物候-气候模型生成的预测DOY进行对比。针对种群水平物候的每个指标（即开花起始、盛花期（中位DOY）与终止），我们随后使用Tukey HSD检验，对比基于不同物候气候敏感性、开花持续时长、种群内物候期变异程度与采集偏倚的物种构建的模型，其预测DOY的平均精度（以MAE估算）与实际种群水平指标之间的差异。类似地，我们检验了具有相同开花持续时长、物候响应度与种群内物候变异的物种组中，估算的开花起始与终止日期的平均MAE，是否与该组物种的估算中位开花日期的平均MAE存在显著差异。我们使用Tukey HSD检验，对比从每个模拟数据集生成的所有物种的盛花期起始、中位与终止日期的估算精度。最后，我们使用随机选取的数据子集（每个物种100~1000个标本）重新拟合全部1200个模型（包括物种参数的全部24种组合，但排除用于测试采集偏倚效应的模型），以确定样本量如何影响模型性能与预测精度。为评估当种群间物候变异无法由模型中包含的气候变量完全解释时，是否需要更多数据，我们开展了额外的模拟：在每个采样点位与年份的每个物种的种群水平平均DOY（以及相关的开花期起始与终止DOY）中加入与局地气候无关的随机变异：为开花起始、中位开花DOY与开花终止DOY分别添加±5天（即低噪声场景）或±15天（即高噪声场景）的随机偏移。对于每个点位与年份，开花起始、中位开花DOY与开花终止DOY的随机偏移量相同，因此随机变异仅被纳入开花时间，而非开花持续时长。

创建时间：

2024-03-28