DataSheet1_Interpretability Versus Accuracy: A Comparison of Machine Learning Models Built Using Different Algorithms, Performance Measures, and Features to Predict E. coli Levels in Agricultural Water.docx
收藏frontiersin.figshare.com2023-06-10 更新2025-01-22 收录
下载链接:
https://frontiersin.figshare.com/articles/dataset/DataSheet1_Interpretability_Versus_Accuracy_A_Comparison_of_Machine_Learning_Models_Built_Using_Different_Algorithms_Performance_Measures_and_Features_to_Predict_E_coli_Levels_in_Agricultural_Water_docx/14595339/1
下载链接
链接失效反馈官方服务:
资源简介:
Since E. coli is considered a fecal indicator in surface water, government water quality standards and industry guidance often rely on E. coli monitoring to identify when there is an increased risk of pathogen contamination of water used for produce production (e.g., for irrigation). However, studies have indicated that E. coli testing can present an economic burden to growers and that time lags between sampling and obtaining results may reduce the utility of these data. Models that predict E. coli levels in agricultural water may provide a mechanism for overcoming these obstacles. Thus, this proof-of-concept study uses previously published datasets to train, test, and compare E. coli predictive models using multiple algorithms and performance measures. Since the collection of different feature data carries specific costs for growers, predictive performance was compared for models built using different feature types [geospatial, water quality, stream traits, and/or weather features]. Model performance was assessed against baseline regression models. Model performance varied considerably with root-mean-squared errors and Kendall’s Tau ranging between 0.37 and 1.03, and 0.07 and 0.55, respectively. Overall, models that included turbidity, rain, and temperature outperformed all other models regardless of the algorithm used. Turbidity and weather factors were also found to drive model accuracy even when other feature types were included in the model. These findings confirm previous conclusions that machine learning models may be useful for predicting when, where, and at what level E. coli (and associated hazards) are likely to be present in preharvest agricultural water sources. This study also identifies specific algorithm-predictor combinations that should be the foci of future efforts to develop deployable models (i.e., models that can be used to guide on-farm decision-making and risk mitigation). When deploying E. coli predictive models in the field, it is important to note that past research indicates an inconsistent relationship between E. coli levels and foodborne pathogen presence. Thus, models that predict E. coli levels in agricultural water may be useful for assessing fecal contamination status and ensuring compliance with regulations but should not be used to assess the risk that specific pathogens of concern (e.g., Salmonella, Listeria) are present.
鉴于大肠杆菌被视为地表水中的粪便指示物,政府水质标准和行业指南通常依赖于大肠杆菌的监测来识别生产用水(例如,用于灌溉)中病原体污染风险增加的情况。然而,研究表明,大肠杆菌检测可能给种植者带来经济负担,而采样与获得结果之间的时间滞后可能会降低这些数据的有用性。预测农业用水中大肠杆菌水平的模型可能提供克服这些障碍的机制。因此,本概念验证研究利用先前发表的数据库集,采用多种算法和性能指标对大肠杆菌预测模型进行训练、测试和比较。由于收集不同特征数据对种植者而言具有特定的成本,因此对使用不同特征类型(地理空间、水质、溪流特征以及/或天气特征)构建的模型进行了预测性能的比较。模型性能与基线回归模型进行了评估。模型性能存在显著差异,均方根误差和肯德尔tau系数分别在0.37至1.03和0.07至0.55之间。总体而言,无论使用何种算法,包含浑浊度、降雨和温度的模型均优于其他所有模型。浑浊度和天气因素也被发现即使在模型中包含其他特征类型时也能驱动模型精度。这些发现证实了先前结论,即机器学习模型可能有助于预测在收获前农业用水源中大肠杆菌(及其相关危害)可能何时、何地以及以何种水平存在。本研究还确定了特定的算法-预测器组合,这些组合应成为未来努力开发可部署模型(即,可用于指导农场决策和风险缓解的模型)的重点。在田间部署大肠杆菌预测模型时,应注意以往研究指出大肠杆菌水平与食源性病原体存在之间关系不一致。因此,预测农业用水中大肠杆菌水平的模型可能有助于评估粪便污染状况并确保符合规定,但不应用于评估特定关注的病原体(例如,沙门氏菌、李斯特菌)存在的风险。
提供机构:
Frontiers



