R code for analysis of Irukandji data of the GBR (NESP TWQ 2.2.3, CSIRO)
收藏Research Data Australia2024-12-14 收录
下载链接:
https://researchdata.edu.au/r-code-analysis-223-csiro/2974540
下载链接
链接失效反馈官方服务:
资源简介:
This dataset presents the code written for the analysis and modelling for the Jellyfish Forecasting System for NESP TWQ Project 2.2.3. The Jellyfish Forecasting System (JFS) searches for robust statistical relationships between historical sting events (and observations) and local environmental conditions. These relationships are tested using data to quantify the underlying uncertainties. They then form the basis for forecasting risk levels associated with current environmental conditions.
The development of the JFS modelling and analysis is supported by the Venomous Jellyfish Database (sting events and specimen samples – November 2018) (NESP 2.2.3, CSIRO) with corresponding analysis of wind fields and tidal heights along the Queensland coastline. The code has been calibrated and tested for the study focus regions including Cairns (Beach, Island, Reef), Townsville (Beach, Island+Reef) and Whitsundays (Beach, Island+Reef).
The JFS uses the European Centre for Medium-Range Weather forecasting (ECMWF) wind fields from the ERA Interim, Daily product (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim). This daily product has global coverage at a spatial resolution of approximately 80km. However, only 11 locations off the Queensland coast were extracted covering the period 1-Jan-1985 to 31-Dec-2016. For the modelling, the data has been transformed into CSV files containing date, eastward wind (m/s) and northward wind (m/s), for each of the 11 geographical locations.
Hourly tidal height was calculated from tidal harmonics supplied by the Bureau of Meteorology (http://www.bom.gov.au/oceanography/projects/ntc/ntc.shtml) using the XTide software (http://www.flaterco.com/xtide/). Hourly tidal heights have been calculated for 7 sites along the Queensland coast (Albany Island, Cairns, Cardwell, Cooktown, Fife, Grenville, Townsville) for the period 1-Jan-1985 to 31-Dec-2017. Data has been transformed into CSV files, one for each of the 7 sites. Columns correspond to number of days since 1-Jan 1990 and tidal height (m).
Irukandji stings were then modelled using a generalised linear model (GLM). A GLM generalises ordinary linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value (McCullagh & Nelder 1989). For each region, we used a GLM with the number of Irukandji stings per day as the response variable. The GLM had a Poisson error structure and a log link function (Crawley 2005). For the Poisson GLMs, we inferred absences when stings were not recorded in the data for a day. We consider that there was reasonably consistent sampling effort in the database since 1985, but very patchy prior to this date. It should be noted that Irukandji are very patchy in time; for example, there was a single sting record in 2017 despite considerable effort trying to find stings in that year. Although the database might miss small and localised Irukandji sting events, we believe it captures larger infestation events.
We included six predictors in the models: Month, two wind variables, and three tidal variables. Month was a factor and arranged so that the summer was in the middle of the year (i.e., from June to May). The two wind variables were Speed and Direction. For each day within each region (Cairns, Townsville or Whitsundays), hourly wind-speed and direction was used. We derived cumulative wind Speed and Direction, working backwards from each day, with the current day being Day 1. We calculated cumulative winds from the current day (Day 1) to 14 days previously for every day in every Region and Area. To provide greater weighting for winds on more recent days, we used an inverse weighting for each day, where the weighting was given by 1/i for each day i. Thus, the Cumulative Speed for n days is given by:
Cumulative Speed_n=(\sum_(i=1)^n Speed_i/i) / (\sum_(i=1)^n 1/i)
For example, calculations for the 3-day cumulative wind speed are:
(1/1×Wind Day 1 + 1/2 × Wind Day 2 + 1/3 × Wind Day 3) / (1/1+1/2+1/3)
Similarly, we calculated the cumulative weighted wind Direction using the formula:
Cumulative Direction_n=(\sum_(i=1)^n Direction_i/i) / (\sum_(i=1)^n 1/i)
We used circular statistics in the R Package Circular to calculate the weighted cumulative mean, because direction 0º is the same as 360º. We initially used a smoother for this term in the model, but because of its non-linearity and the lack of winds of all directions, we found that it was better to use wind Direction as a factor with four levels (NW, NE, SE and SW). In some Regions and Areas, not all wind Directions were present.
To assign each event to the tidal cycle, we used tidal data from the closest of the seven stations to calculate three tidal variables: (i) the tidal range each day (m); (ii) the tidal height (m); and (iii) whether the tide was incoming or outgoing. To estimate the three tidal variables, the time of day of the event was required. However, the Time of Day was only available for 780 observations, and the 291 missing observations were estimated assuming a random Time of Day, which will not influence the relationship but will keep these rows in the analysis. Tidal range was not significant in any models and will not be considered further.
To focus on times when Irukandji were present, months when stings never occurred in an area/region were excluded from the analysis – this is generally the winter months. For model selection, we used Akaike Information Criterion (AIC), which is an estimate of the relative quality of models given the data, to choose the most parsimonious model. We thus do not talk about significant predictors, but important ones, consistent with information theoretic approaches.
Limitations:
It is important to note that while the presence of Irukandji is more likely on high risk days, the forecasting system should not be interpreted as predicting the presence of Irukandji or that stings will occur.
Format:
It is a text file with a .r extension, the default code format in R. This code runs on the csv datafile “VJD_records_EXTRACT_20180802_QLD.csv” that has latitude, longitude, date, and time of day for each Irukandji sting on the GBR. A subset of these data have been made publicly available through eAtlas, but not all data could be made publicly available because of permission issues. For more information about data permissions, please contact Dr Lisa Gershwin (lisa.gershwin@stingeradvisor.com).
Data Location:
This dataset is filed in the eAtlas enduring data repository at: data\custodian\2016-18-NESP-TWQ-2\2.2.3_Jellyfish-early-warning\data\ and https://github.com/eatlas/NESP_2.2.3_Jellyfish-early-warning
本数据集收录了针对NESP TWQ项目2.2.3的水母预报系统(Jellyfish Forecasting System, JFS)开展分析与建模所用的代码。水母预报系统(JFS)旨在挖掘历史蜇伤事件(及观测数据)与当地环境条件之间的稳健统计关联关系,通过数据检验这些关联以量化其内在不确定性,并以此为基础针对当前环境条件预测相关风险等级。
JFS建模与分析工作的开展依托于有毒水母数据库(蜇伤事件与标本样本——2018年11月)(NESP 2.2.3、CSIRO),并结合昆士兰州沿岸的风场与潮位数据开展对应分析。本代码已针对研究重点区域完成校准与测试,研究区域包括凯恩斯(海滩、岛屿、礁体)、汤斯维尔(海滩、岛屿+礁体)以及圣灵群岛(海滩、岛屿+礁体)。
JFS使用欧洲中期天气预报中心(European Centre for Medium-Range Weather Forecasting, ECMWF)的ERA Interim每日再分析产品风场数据(https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim)。该每日产品空间分辨率约为80km,覆盖全球范围,但仅提取了昆士兰州沿岸11个站点的数据,时间跨度为1985年1月1日至2016年12月31日。建模过程中,数据已转换为CSV格式文件,包含11个地理站点各自的日期、东向风速(m/s)与北向风速(m/s)字段。
逐小时潮高由澳大利亚气象局(Bureau of Meteorology, BoM)提供的潮汐调和常数(http://www.bom.gov.au/oceanography/projects/ntc/ntc.shtml)结合XTide软件(http://www.flaterco.com/xtide/)计算得到。我们已针对昆士兰州沿岸7个站点(奥尔巴尼岛、凯恩斯、卡德韦尔、库克敦、法夫、格伦维尔、汤斯维尔)计算了1985年1月1日至2017年12月31日期间的逐小时潮高,并将数据转换为CSV格式文件,每个站点对应一个文件。文件列字段分别为自1990年1月1日起的天数与潮高(m)。
随后采用广义线性模型(generalised linear model, GLM)对伊鲁坎吉水母蜇伤事件进行建模。广义线性模型对普通线性回归进行了推广,允许通过连接函数将线性模型与响应变量相关联,同时允许每次测量的方差幅度为其预测值的函数(McCullagh & Nelder 1989)。针对每个研究区域,我们以每日伊鲁坎吉水母蜇伤数作为响应变量,构建了带有泊松误差结构与对数连接函数的广义线性模型(Crawley 2005)。针对泊松广义线性模型,我们将当日无蜇伤记录的情况视为零发生。我们认为,1985年以来数据库的采样强度相对一致,但1985年之前的采样非常零散。需要说明的是,伊鲁坎吉水母的分布在时间上极为零散:例如2017年尽管开展了大量搜寻工作,仅记录到1起蜇伤事件。尽管该数据库可能遗漏了小型局域性的伊鲁坎吉水母蜇伤事件,但我们认为其可捕捉到大规模的水母暴发事件。
我们在模型中纳入了6个预测变量:月份、2个风变量以及3个潮汐变量。月份作为分类因子,将夏季设置在年中(即6月至次年5月)。2个风变量分别为风速与风向。针对每个区域(凯恩斯、汤斯维尔或圣灵群岛)的每日数据,我们使用了逐小时的风速与风向数据,并以当日为第1天,向前回溯计算累积风速与累积风向。我们针对每个区域与子区域的每日数据,计算了自当日(第1天)至此前14天的累积风场数据。为了对近期风速赋予更高权重,我们采用了逆加权方式,即第i天的权重为1/i。因此,n天累积风速的计算公式为:
Cumulative Speed_n=(∑_(i=1)^n Speed_i/i) / (∑_(i=1)^n 1/i)
例如,3天累积风速的计算方式为:
(1/1×第1天风速 + 1/2 × 第2天风速 + 1/3 × 第3天风速) / (1/1+1/2+1/3)
同理,我们采用以下公式计算累积加权风向:
Cumulative Direction_n=(∑_(i=1)^n Direction_i/i) / (∑_(i=1)^n 1/i)
由于风向0°与360°为同一方向,我们使用R语言Circular包中的圆形统计方法计算加权累积均值。最初我们在模型中对风向项使用了平滑处理,但由于其非线性特征且缺乏全方向的风场数据,我们最终将风向划分为4个分类因子水平(西北NW、东北NE、东南SE、西南SW)。部分区域与子区域可能未涵盖所有风向类别。
为了将每个事件匹配至潮汐周期,我们使用距离最近的7个潮汐站点的数据计算3个潮汐变量:(i) 当日潮差(m);(ii) 当日潮高(m);(iii) 潮汐处于涨潮还是落潮状态。为估算这3个潮汐变量,需要获取事件发生的当日时刻,但仅780条观测记录带有当日时刻信息,剩余291条缺失的当日时刻数据通过随机赋值的方式进行估算——该操作不会影响变量间的关联关系,但可保留分析中的全部样本。潮差在所有模型中均未表现出显著性,因此不再进一步考虑。
为聚焦伊鲁坎吉水母出现的时段,我们将未发生过蜇伤事件的月份从分析中剔除——这类月份通常为冬季。模型选择阶段,我们采用赤池信息准则(Akaike Information Criterion, AIC)——一种基于数据评估模型相对质量的指标——来选择最简洁的最优模型。因此,我们不会讨论“显著”的预测变量,而是采用信息论方法聚焦于重要的预测变量。
### 局限性
需要特别说明的是,尽管伊鲁坎吉水母出现的概率在高风险日更高,但本预报系统不应被解读为直接预测伊鲁坎吉水母的出现或蜇伤事件的发生。
### 数据格式
本数据集为R语言默认代码格式的.r扩展名文本文件,运行代码所需的CSV数据文件为"VJD_records_EXTRACT_20180802_QLD.csv",该文件包含大堡礁区域每一起伊鲁坎吉水母蜇伤事件的纬度、经度、日期与当日时刻。其中部分数据已通过eAtlas平台公开,但由于权限问题,并非所有数据均可公开。如需了解数据权限相关信息,请联系Lisa Gershwin博士(邮箱:lisa.gershwin@stingeradvisor.com)。
### 数据存储位置
本数据集存储于eAtlas长期数据仓库中,路径为:datacustodian2016-18-NESP-TWQ-22.2.3_Jellyfish-early-warningdata 以及 https://github.com/eatlas/NESP_2.2.3_Jellyfish-early-warning
提供机构:
Australian Ocean Data Network



