Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”
收藏DataCite Commons2026-04-28 更新2024-07-13 收录
下载链接:
https://www.osti.gov/servlets/purl/2318723
下载链接
链接失效反馈官方服务:
资源简介:
This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at
Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.
本数据包与提交至《地球物理研究杂志:机器学习与计算》的论文《利用社区生成数据与机器学习预测分布式河流沉积物呼吸速率》(Scheibe等人,2024)相关联。
河流沉积物呼吸速率的观测不仅成本高昂且耗时费力,目前尚无可用的物理模型来预测该指标。本研究采用全球动态河流系统水生生物地球化学观测网络(Worldwide Hydrobiogeochemistry Observation Network for Dynamic River Systems, WHONDRS)的观测数据集(Goldman等人,2020)来训练机器学习(Machine Learning, ML)模型,以实现未采样点位呼吸速率的预测。本仓库存档了训练数据、机器学习模型、预测结果以及模型评估结果,旨在保障相关论文研究结果的可复现性,并供社区复用本项目中训练得到的机器学习模型。
本研究面临的核心挑战之一,是为适配该特征丰富(即包含100余种潜在输入变量)的数据集,寻找到最优的机器学习模型配置方案。为此,我们采用了双层分析框架来处理该复杂数据集:1)堆叠集成机器学习模型,可自动优化超参数以加速模型选择与调优流程;2)特征置换重要性(Feature Permutation Importance, FPI)分析,以迭代筛选出对机器学习模型最为关键的特征(即输入变量)。本机器学习工作流的核心组件均采用模块化、可移植、开源且基于云平台的设计思路,因此该实现方案可作为其他应用场景的潜在参考模板。
本数据包与以下GitHub仓库相关联:请查阅文件级元数据(file level metadata, flmd;文件名为"sl-archive-whondrs_flmd.csv"),以获取本数据包包含的所有文件列表及各文件的说明。请查阅数据字典(data dictionary, dd;文件名为"sl-archive-whondrs_dd.csv"),以获取本数据包中所有逗号分隔值(Comma Separated Value, CSV)文件的列标题列表及各列的说明。
该GitHub仓库共包含5个顶级目录:(1) "input_data":存储机器学习模型的训练数据;(2) "ml_models":存储基于"input_data"中数据训练得到的机器学习模型;(3) "scripts":包含适配本数据集的数据预处理、后处理脚本,以及衔接机器学习工作流的中间结果文件;(4) "examples":包含本仓库中的结果可视化内容,例如用于论文绘图的脚本(如模型评估、FPI结果可视化脚本),以及用于运行机器学习模型预测的脚本(即复用已训练好的机器学习模型);(5) "output_data":存储该分支下机器学习模型的整体运行结果。
每个已训练的机器学习模型均对应仓库中的一个独立分支,这意味着不同分支间的输入与输出数据可能存在差异。此外,根据训练机器学习模型所用的特征数量不同,预处理与后处理脚本及其中间结果也可能因分支而异。"main-*"类分支为各模型分支(即“子分支”)的起始基准(即“主干”)。如需了解更多细节,请查阅GitHub仓库顶级目录中的README.md文件的“分支导航”章节。
仓库中还包含一个隐藏目录".github/workflows"。该隐藏目录包含了如何将机器学习工作流作为端到端自动化GitHub Action运行的相关配置信息,但复用本仓库中存档的机器学习模型时无需使用该目录。如需了解更多自动化相关细节,请查阅GitHub仓库顶级目录中的README.md文件。
提供机构:
River Corridor and Watershed Biogeochemistry SFA
创建时间:
2024-03-23



