Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11076832
下载链接
链接失效反馈官方服务:
资源简介:
The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.
System requirements
· The software can be used in any operating system where R can be installed.
· We have developed and tested the software in a windows environment with R version 4.3.1.
· Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).
· Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).
· Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).
· The use of the software does not require any non-standard hardware.
Installation guide
· The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.
Demo 1: Software demo with simulated data
The software demonstration consists of two R-markdown files:
· D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).
· D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.
Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.
Demo 2: Software demo with a small subset of the data used in the paper
The software demonstration consists of one R-markdown file:
MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.
Scripts and data for reproducing the results presented in the paper (Demo 3)
The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:
· S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.
· S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.
· S03_import_posterior – imports the posterior distributions sampled by the initial model.
· S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.
· S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.
· S06_construct_coral_priors – calculate CORAL prior parameters.
The remaining scripts evaluate the model:
· S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.
· S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.
· S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.
· S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R2 produced by cross-validation. Generates Fig. 4 of the paper.
· S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.
· S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.
· S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.
Pre-processing scripts:
· P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.
· P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.
· P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).
Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.
创建时间:
2025-03-03



