Models, data, and scripts associated with “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning”

DataONE2024-03-08 更新2024-06-08 收录

下载链接：

https://search.dataone.org/view/ess-dive-304e105a97978d2-20240308T160806276

下载链接

链接失效反馈

官方服务：

资源简介：

This data package is associated with the publication “Prediction of Distributed River Sediment Respiration Rates using Community-Generated Data and Machine Learning’’ submitted to the Journal of Geophysical Research: Machine Learning and Computation (Scheibe et al. 2024). River sediment respiration observations are expensive and labor intensive to obtain and there is no physical model for predicting this quantity. The Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems (WHONDRS) observational data set (Goldman et al.; 2020) is used to train machine learning (ML) models to predict respiration rates at unsampled sites. This repository archives training data, ML models, predictions, and model evaluation results for the purposes of reproducibility of the results in the associated manuscript and community reuse of the ML models trained in this project. One of the key challenges in this work was to find an optimum configuration for machine learning models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we used a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of ML models that can automatically optimize hyperparameters to accelerate the process of model selection and tuning and 2) feature permutation importance to iteratively select the most important features (i.e. inputs) to the ML models. The major elements of this ML workflow are modular, portable, open, and cloud-based, thus making this implementation a potential template for other applications. This data package is associated with the GitHub repository found at https://github.com/parallelworks/sl-archive-whondrs. A static copy of the GitHub repository is included in this data package as an archived version at the time of publishing this data package (March 2023). However, we recommend accessing these files via GitHub for full functionality. Please see the file level metadata (flmd; “sl-archive-whondrs_flmd.csv”) for a list of all files contained in this data package and descriptions for each. Please see the data dictionary (dd; “sl-archive-whondrs_dd.csv”) for a list of all column headers contained within comma separated value (csv) files in this data package and descriptions for each. The GitHub repository is organized into five top-level directories: (1) “input_data” holds the training data for the ML models; (2) “ml_models” holds machine learning models trained on the data in “input_data”; (3) “scripts” contains data preprocessing and postprocessing scripts and intermediate results specific to this data set that bookend the ML workflow; (4) “examples” contains the visualization of the results in this repository including plotting scripts for the manuscript (e.g., model evaluation, FPI results) and scripts for running predictions with the ML models (i.e., reusing the trained ML models); (5) “output_data” holds the overall results of the ML model on that branch. Each trained ML model resides on its own branch in the repository; this means that inputs and outputs can be different branch-to-branch. Furthermore, depending on the number of features used to train the ML models, the preprocessing and postprocessing scripts, and their intermediate results, can also be different branch-to-branch. The “main-*” branches are meant to be starting points (i.e. trunks) for each model branch (i.e. sprouts). Please see the Branch Navigation section in the top-level README.md in the GitHub repository for more details. There is also one hidden directory “.github/workflows”. This hidden directory contains information for how to run the ML workflow as an end-to-end automated GitHub Action but it is not needed for reusing the ML models archived here. Please the top-level README.md in the GitHub repository for more details on the automation.

本数据包与提交至《地球物理研究杂志：机器学习与计算》的论文《基于社区生成数据与机器学习预测分布式河流沉积物呼吸速率》（Scheibe等人，2024）相关联。河流沉积物呼吸速率的观测成本高昂且耗时费力，目前尚无用于预测该指标的物理模型。本研究采用全球动态河流水系水文生物地球化学观测网络（Worldwide Hydrobiogeochemisty Observation Network for Dynamic River Systems, WHONDRS）的观测数据集（Goldman等人，2020）来训练机器学习（Machine Learning, ML）模型，以实现未采样点位呼吸速率的预测。本仓库归档了训练数据、机器学习模型、预测结果及模型评估结果，旨在保障相关论文研究结果的可复现性，并供社区复用本项目中训练得到的机器学习模型。本研究的核心挑战之一，是为适配该特征丰富（即包含100余种潜在输入变量）的数据集寻找到最优的机器学习模型配置方案。为此，我们采用双层流程来管理该复杂数据集的分析：其一，搭建可自动优化超参数的堆叠集成机器学习模型，以加速模型选择与调优流程；其二，使用特征置换重要性（feature permutation importance）来迭代筛选对机器学习模型最为重要的特征（即输入变量）。本机器学习工作流的核心组件具备模块化、可移植性、开源性与云原生特性，因此该实现可作为其他应用场景的潜在参考模板。本数据包关联的GitHub仓库地址为https://github.com/parallelworks/sl-archive-whondrs。本数据包中附带了该仓库在2023年3月数据包发布时的归档静态副本，但我们推荐通过GitHub直接获取文件以获得完整功能。请参阅文件级元数据（file level metadata, flmd，即"sl-archive-whondrs_flmd.csv"）以获取本数据包包含的全部文件清单及各文件的说明。请参阅数据字典（data dictionary, dd，即"sl-archive-whondrs_dd.csv"）以获取本数据包中所有逗号分隔值（Comma Separated Value, CSV）文件的列标题清单及各列的说明。该GitHub仓库包含5个顶层目录：（1）"input_data"：存放机器学习模型的训练数据；（2）"ml_models"：存放基于"input_data"中数据训练得到的机器学习模型；（3）"scripts"：包含适配本数据集的数据预处理、后处理脚本及中间结果，用于衔接整个机器学习工作流；（4）"examples"：包含本仓库的结果可视化内容，例如用于论文绘图的脚本（如模型评估、特征置换重要性结果可视化）以及用于复用已训练机器学习模型进行预测的脚本；（5）"output_data"：存放该分支下机器学习模型的整体运行结果。每个训练完成的机器学习模型均对应仓库中的一个独立分支，这意味着不同分支的输入与输出可能存在差异。此外，根据训练机器学习模型所用的特征数量不同，预处理、后处理脚本及其中间结果也可能因分支而异。"main-*"分支作为每个模型分支（即"子分支"）的起始基点（即"主干"）。如需了解更多细节，请参阅GitHub仓库顶层README.md文件中的"分支导航"章节。仓库中还包含一个隐藏目录".github/workflows"，该目录存放了将本机器学习工作流作为端到端自动化GitHub Action运行的相关配置信息，但复用本仓库中归档的机器学习模型时无需用到该目录。如需了解更多自动化相关细节，请参阅GitHub仓库顶层README.md文件。

创建时间：

2024-03-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集