Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods

Mendeley Data2024-06-03 更新2024-06-28 收录

下载链接：

https://zenodo.org/records/11371208

下载链接

链接失效反馈

官方服务：

资源简介：

The README explains how to apply the scripts and data provided in this depository to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. This README explains how to (1) use the software with a small, simulated dataset; and (2) how to reproduce the analyses presented in the paper. System requirements · The software can be used in any operating system where R can be installed. · We have developed and tested the software in a windows environment with R version 4.3.1. · The simulated case study requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.0-14), pROC (1.18.5) and MCMCpack (1.7-0). · Replicating the analyses reported in the paper requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.0-14), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0). · The use of the software does not require any non-standard hardware. Installation guide · The software is presented as a R-pipeline and thus it does not require any installation other than installation of R. Software demo The software demonstration consists of two R-markdown files: · D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors). · D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation. Both markdown files provide more detailed information and illustrations. The provided html file show the expected output. The running time of the demonstration is very short, from few seconds to at most one minute. Scripts and data for reproducing the results presented in the paper The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior: · S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects. · S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive. · S03_import_posterior – imports the posterior distributions sampled by the initial model. · S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive. · S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper. · S06_construct_coral_priors – calculate CORAL prior parameters. The remaining scripts evaluate the model: · S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper. · S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition. · S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive. · S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R2 produced by cross-validation. Generates Fig. 4 of the paper. · S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive. · S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive. · S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper. Pre-processing scripts: · P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects. · P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata. · P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.

本README文件阐述了如何运用本存储库中提供的脚本与数据，复现Ovaskainen等人发表的论文《Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods》（即《普通至稀有物种迁移学习（CORAL）支持25万种马达加斯加稀有节肢动物的推断与预测》）中所述方法的应用流程。此外，本README还说明了两点：(1) 如何结合小型模拟数据集使用该软件；(2) 如何复现论文中呈现的分析结果。系统要求 · 该软件可在任何可安装R语言的操作系统中运行。 · 我们在搭载R 4.3.1版本的Windows环境中开发并测试了本软件。 · 模拟案例研究所需的R包包括：phytools (2.1-1)、MASS (7.3-60)、Hmsc (3.0-14)、pROC (1.18-5)以及MCMCpack (1.7-0)。 · 复现论文中报告的分析所需的R包包括：phytools (2.1-1)、MASS (7.3-60)、Hmsc (3.0-14)、pROC (1.18-5)、MCMCpack (1.7-0)、jsonify (1.2.2)、buildmer (2.11)、colorspace (2.1-0)、matlib (0.9.6)、vioplot (0.4.0)、MLmetrics (1.1.3)以及ggplot2 (3.5.0)。 · 本软件的使用无需任何非标准硬件配置。安装指南本软件以R流程（R-pipeline）形式提供，因此除R语言本身外，无需进行其他额外安装。软件演示本软件演示包含两个R标记文档（R-markdown）： · D01_software_demo_simulate_data。该脚本用于生成包含200个采样单元、100个物种的模拟数据集。物种出现情况通过概率单位模型模拟，该模型假设物种对两种环境因子存在系统发育结构的响应。本流程会将数据分析所需的全部数据保存至文件allDataDemo.RData中：XData（第一种环境因子；第二种环境因子因假设对用户不可见，故未随数据集提供）、Y（物种出现数据）、phy（系统发育树）、studyDesign（采样单元列表）。此外，数据生成时使用的真实参数值会保存至文件trueValuesDemo.RData中：LF（将通过潜在因子方法估算的第二种环境因子）以及beta（物种对环境因子的响应系数）。 · D02_software_demo_apply_CORAL。该脚本会加载由D01脚本生成的数据，并对其应用CORAL方法。本脚本可演示CORAL先验的信息性、CORAL模型相较于基线模型更高的预测能力，以及CORAL估算数据生成时所用真实参数值的能力。两份标记文档均提供了更详细的说明与示例。附带的HTML文件展示了预期输出结果。本次演示的运行时长极短，仅需数秒至最多一分钟。复现论文所述结果的脚本与数据本脚本流程的输入数据为文件"allData.RData"。该文件包含元数据（meta）、响应矩阵（Y）以及分类学信息（taxonomy）。本流程中的每个脚本均依赖于前序脚本的输出结果，必须按顺序运行。前六个脚本用于搭建核心HMSC模型并计算CORAL先验的参数： · S01_define_Hmsc_model：定义包含固定效应以及采样单元和样地水平随机效应的初始HMSC模型。 · S02_export_Hmsc_model：为使用HMSC-HPC进行高性能计算（HPC）采样准备初始模型。随后可在HPC环境中通过该脚本生成的bash文件完成模型拟合，该步骤计算量较大。 · S03_import_posterior：导入由初始模型采样得到的后验分布。 · S04_define_second_stage_Hmsc_model：从初始模型中提取潜在因子并搭建核心模型，随后可通过相同的S02导出 + S03导入脚本完成采样，该步骤计算量较大。 · S05_visualize_backbone_model：通过可视化/数值汇总检查核心模型的质量，生成论文中的图2。 · S06_construct_coral_priors：计算CORAL先验参数。剩余脚本用于评估模型： · S07_evaluate_prior_predictions：使用CORAL先验预测稀有物种的存在/缺失情况，并以受试者工作特征曲线下面积（AUC）为指标评估预测效果，生成论文中的图3。 · S08_make_training_test_split：生成用于交叉验证的训练集/测试集划分，确保每个划分中至少包含40%的阳性样本。 · S09_cross-validate：针对训练集/测试集划分拟合CORAL模型与基线模型，并计算性能汇总指标。注：我们首先使用初始训练/测试划分运行一次，随后使用反向划分（即代码中将训练集设为非初始训练集，详见注释）再次运行。论文中呈现的是这两次划分的平均结果。该步骤计算量较大。 · S10_show_cross-validation_results：绘制可视化交叉验证所得AUC/Tjur判定系数R²的图表，生成论文中的图4。 · S11a_fit_coral_models：针对全部25万种稀有物种拟合CORAL模型，该步骤计算量较大。 · S11b_fit_baseline_models：针对全部25万种稀有物种拟合基线模型，该步骤计算量较大。 · S12_compare_posterior_inference：针对选定物种对比CORAL模型与基线模型的后验气候预测结果，并计算所有物种的方差缩减情况，生成论文中的图5。预处理脚本： · P01_preprocess_sequence_data.R：读取生物信息学流程的输出结果，并将其转换为R对象。 · P02_download_climatic_data.R：从"sis-biodiversity-era5-global"下载气候数据并将其添加至元数据中。 · P03_construct_Y_matrix.R：将响应矩阵从稀疏数据格式转换为常规矩阵，并保存包含元数据（meta）、响应矩阵（Y）以及分类学信息（taxonomy）的"allData.RData"文件。计算量较大的文件在高性能计算机器上的运行时长为5至24小时。初步测试表明，在标准笔记本电脑上的运行时长将超过100小时。

创建时间：

2024-05-30