Baltic Sea algal bloom prediction: multiple Regression Analysis
收藏DataCite Commons2025-10-30 更新2026-05-04 收录
下载链接:
https://mostwiedzy.pl/en/open-research-data/baltic-sea-algal-bloom-prediction-multiple-regression-analysis,1030040507107347-0
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains a comprehensive Jupyter notebook implementing multiple regression analysis for predicting algal bloom occurrence in the Baltic Sea based on environmental parameters. The notebook integrates multi-source oceanographic data including chlorophyll concentrations, nutrient levels (phosphates, nitrates, ammonia), and sea surface temperature measurements to develop predictive models for bloom risk assessment at coastal bathing sites.
The analysis pipeline includes:
Data integration from multiple marine databases (Copernicus Marine Service, EMODnet, EEA)
Geographic coordinate matching with tolerance-based spatial interpolation
Statistical modeling using scikit-learn and statsmodels libraries
Comprehensive model diagnostics including VIF analysis, residual testing, and validation metrics
Advanced spatial visualization with risk zone mapping using Cartopy
Interactive risk assessment for bathing water quality monitoring
The notebook demonstrates reproducible workflows for environmental data science, suitable for educational purposes, policy-making support, and further research in marine ecology and coastal management. All analysis steps are documented with detailed explanations in English, making it accessible for international scientific community.
Regression analysis methodology:
Model type: Ordinary Least Squares (OLS) multiple linear regression
Dependent variable: Chlorophyll-a concentration (μg/L) as algal bloom indicator
Independent variables: Phosphate (PO4), nitrate (NO3), ammonia (NH4) concentrations, and sea surface temperature (SST)
Data preprocessing: Standardization (Z-score normalization) of predictor variables, missing value removal, time-averaged aggregation for multi-temporal measurements
Model validation: 80/20 train-test split with performance evaluation using R², RMSE, and MAE metrics
Multicollinearity assessment: Variance Inflation Factor (VIF) analysis to detect correlation among predictors
Statistical inference: P-value testing for variable significance (α=0.05 threshold)
Diagnostic procedures: Residual analysis (normality testing via Q-Q plots, homoscedasticity inspection), predicted vs. actual value comparison
Spatial analysis: Cubic spline interpolation for generating continuous risk surfaces from point measurements, contour-based risk zone delineation
Technical specifications:
Programming language: Python 3
Key libraries: pandas, scikit-learn, statsmodels, cartopy, seaborn
Input data format: CSV/TSV with geographic coordinates
Output: Statistical models, diagnostic plots, risk maps, and CSV reports
Computational approach: Memory-efficient chunked processing for large datasets
Limitations and scope:
This is a demonstration of regression analysis methodology. Geographic tolerance values (±0.1°), risk classification thresholds, and other parameters used in this notebook are set for illustrative purposes and are not based on peer-reviewed scientific evidence or validated ecological standards. Users must not apply this analysis for operational decision-making without appropriate modification of parameters based on domain expertise, local environmental conditions, and scientifically validated thresholds for their specific use case.
The notebook is specifically designed for Galaxy workflow integration and expects data in a particular JSON input format (galaxy_inputs.json)
Analysis parameters and thresholds are calibrated for Baltic Sea environmental conditions and may require adjustment for other marine regions
Spatial interpolation assumes relatively uniform data distribution; sparse datasets may produce unreliable risk zone boundaries
提供机构:
Gdańsk University of Technology
创建时间:
2025-10-30



