Probability distribution grids of dissolved oxygen and dissolved manganese concentrations at selected thresholds in drinking water depth zones, Central Valley, California

DataONE2018-01-27 更新2024-06-25 收录

下载链接：

https://search.dataone.org/view/a905afa4-cdf2-4f19-ac0d-42423de2d684

下载链接

链接失效反馈

官方服务：

资源简介：

The ascii grids represent regional probabilities that groundwater in a particular location will have dissolved oxygen (DO) concentrations less than selected threshold values representing anoxic groundwater conditions or will have dissolved manganese (Mn) concentrations greater than selected threshold values representing secondary drinking water-quality contaminant levels (SMCL) and health-based screening levels (HBSL) for water quality. The probability models were constrained by the alluvial boundary of the Central Valley to a depth of approximately 300 meters (m). We utilized prediction modeling methods, specifically boosted regression trees (BRT) with a Bernoulli error distribution within a statistical learning framework within R's computing framework (http://www.r-project.org/) to produce two-dimensional probability grids at selected depths throughout the modeling domain. The statistical learning framework seeks to maximize the predictive performance of machine learning methods through model tuning by cross validation. Models were constructed using measured dissolved oxygen and manganese concentrations sampled from 2,767 wells within the alluvial boundary of the Central Valley and over 60 predictor variables from 7 sources (see metadata) and were assembled to develop a model that incorporates regional-scale soil properties, soil chemistry, land use, aquifer textures, and aquifer hydrology. Previously developed Central Valley model outputs of textures (Central Valley Textural Model, CVTM; Faunt and others, 2010) and MODFLOW-simulated vertical water fluxes and predicted depth to water table (Central Valley Hydrologic Model, CVHM; Faunt, 2009) were used to represent aquifer textures and groundwater hydraulics, respectively. The wells used in the BRT models described above were attributed to predictor variable values in ArcGIS using a 500-m buffer. The response variable data consisted of measured DO and Mn concentrations from 2,767 wells within the alluvial boundary of the Central Valley. The data were compiled from two sources: U.S. Geological Survey (USGS) National Water Information System (NWIS) database (all data are publicly available from the USGS at http://waterdata.usgs.gov/ca/nwis/nwis) and the California State Water Resources Control Board Division of Drinking Water (SWRCB-DDW) database (water-quality data are publicly available from the SWRCB at http://geotracker.waterboards.ca.gov/gama/). Only wells with well depth data were selected, and for wells with multiple records, only the most recent sample in the period 1993–2014 that had the required water-quality data was used. Data were available for 932 wells for the NWIS dataset and 1,835 wells for the SWRCB-DDW dataset. Models were trained on a USGS NWIS dataset of 932 wells and evaluated on an independent hold-out dataset of 1,835 wells from the SWRCB-DDW. We used cross-validation to assess the predictive performance of models of varying complexity as a basis for selecting the final models used to create the prediction grids. Trained models were applied to cross-validation testing data and a separate hold-out dataset to evaluate model predictive performance by emphasizing three model metrics of fit: Kappa, accuracy, and the area under the receiver operator characteristic (ROC) curve. The final trained models were used for mapping predictions at discrete depths to a depth of approximately 300 m. Trained DO and Mn models had accuracies of 86–100 percent, Kappa values of 0.69–0.99, and ROC values of 0.92–1.0. Model accuracies for cross-validation testing datasets were 82–95 percent, and ROC values were 0.87–0.91, indicating good predictive performance. Kappa values for the cross-validation testing dataset were 0.30–0.69, indicating fair to substantial agreement between testing observations and model predictions. Hold-out data were available for the manganese model only and indicated accuracies of 89–97 percent, ROC values of 0.73–0.75, and Kappa values of 0.06–0.30. The predictive performance of both the DO and Mn models was reasonable, considering all three of these fit metrics and the low percentages of low-DO and high-Mn events in the data. See associated journal article (Rosecrans and others, 2017) for complete summary of BRT modeling methods, model fit metrics, and relative influence of predictor variables for a given DO or Mn BRT model. The modeled response variables for the DO BRT models were based on measured DO values from wells at the following thresholds: <0.5 milligrams per liter (mg/L), <1.0 mg/L, and <2.0 mg/L, and these thresholds values were considered anoxic based on literature reviews. The modeled response variables for the Mn BRT models were based on measured Mn values from wells at the following exceedance thresholds: >50 micrograms per liter (µg/L), >150 µg/L, and >300 µg/L. (The 150 µg/L manganese threshold represents one-half the USGS HBSL.) The prediction grid discretization below land surface was in 15-m intervals to a depth of 122 m, followed by intervals of 30 m to a depth of 300 m, resulting in 14 two-dimensional probability grids for each constituent (DO and Mn) and threshold. Probability grid maps were also created for the shallow aquifer and deep aquifer represented by the median domestic and public-supply well depths, respectively. A depth of 46 m was used to stratify wells from the training dataset into the shallow and deep aquifer and was derived from depth percentiles associated with domestic and public supply in previous work by Burow and others (2013). In this work, the median well depth categorized as domestic was 30 m below land surface (bls), and the median well depth categorized as public-supply wells was 100 m bls. Therefore, datasets contained in the folders named "DO BRT prediction grids.zip" and "Mn BRT prediction grids.zip" each have 42 probability grids representing specific depths for each of the selected thresholds of DO and Mn BRT threshold models described above. The dataset contained in the folder named "PublicSupply&DomesticGrids.zip" contains probability grids represented by the domestic and public-supply drinking water depths for each of the six BRT models described above (12 grids total).

ASCII格网（ASCII grids）表征了特定区域地下水的两类概率：一是地下水中溶解氧（dissolved oxygen, DO）浓度低于选定阈值（该阈值对应缺氧地下水条件）的概率；二是地下水中溶解锰（dissolved manganese, Mn）浓度高于选定阈值（该阈值对应饮用水质量二次污染物标准（secondary drinking water-quality contaminant levels, SMCL）与水质健康基准筛查水平（health-based screening levels, HBSL））的概率。本概率模型以中央谷冲积边界为约束范围，建模深度约为300米（m）。本研究采用预测建模方法，具体为在R计算框架（http://www.r-project.org/）的统计学习框架内，使用带有伯努利误差分布的提升回归树（boosted regression trees, BRT），在建模域内选定深度处生成二维概率格网。该统计学习框架通过交叉验证进行模型调优，以最大化机器学习方法的预测性能。模型基于中央谷冲积边界内2767口井的实测溶解氧与锰浓度数据，以及来自7个数据源的60余个预测变量（详见元数据）构建，整合了区域尺度的土壤属性、土壤化学、土地利用、含水层岩性与含水层水文等信息。本研究采用已公开的中央谷相关模型输出结果：其中中央谷岩性模型（Central Valley Textural Model, CVTM; Faunt等, 2010）的输出用于表征含水层岩性，MODFLOW模拟的垂直水通量与预测的地下水位埋深（来自中央谷水文模型（Central Valley Hydrologic Model, CVHM; Faunt, 2009））则用于表征地下水动力条件。上述BRT模型所使用的井位数据，通过ArcGIS以500米缓冲区的方式匹配了对应预测变量的取值。响应变量数据来自中央谷冲积边界内2767口井的实测DO与Mn浓度。本数据集整合自两个数据源：美国地质调查局（U.S. Geological Survey, USGS）国家水信息系统（National Water Information System, NWIS）数据库（所有数据可通过USGS公开获取，网址为http://waterdata.usgs.gov/ca/nwis/nwis），以及加州州水资源控制委员会饮用水分部（California State Water Resources Control Board Division of Drinking Water, SWRCB-DDW）数据库（水质数据可通过SWRCB公开获取，网址为http://geotracker.waterboards.ca.gov/gama/）。本研究仅选取了具备井深数据的井位；对于存在多条记录的井位，仅采用1993–2014年间采集的、具备完整必要水质数据的最新样本。其中NWIS数据集包含932口井的有效数据，SWRCB-DDW数据集包含1835口井的有效数据。模型以包含932口井的USGS NWIS数据集进行训练，并以来自SWRCB-DDW的1835口井独立预留数据集进行性能评估。本研究通过交叉验证评估不同复杂度模型的预测性能，以此为依据选定最终模型，用于生成预测格网。将训练完成的模型应用于交叉验证测试数据与独立预留数据集，通过三类拟合指标评估模型预测性能：Kappa系数、准确率以及受试者工作特征（receiver operator characteristic, ROC）曲线下面积。最终训练完成的模型被用于生成建模深度约300米内的离散深度预测分布图。训练完成的DO与Mn模型的准确率为86%–100%，Kappa系数为0.69–0.99，ROC曲线下面积为0.92–1.0。交叉验证测试数据集的模型准确率为82%–95%，ROC曲线下面积为0.87–0.91，表明模型具备良好的预测性能。交叉验证测试数据集的Kappa系数为0.30–0.69，表明测试观测值与模型预测值之间存在中等至高度的一致性。仅锰模型具备预留测试数据，其准确率为89%–97%，ROC曲线下面积为0.73–0.75，Kappa系数为0.06–0.30。结合上述三类拟合指标以及数据集内低DO浓度与高Mn浓度事件占比偏低的情况来看，DO与Mn模型的预测性能均较为合理。如需了解BRT建模方法、模型拟合指标以及特定DO或Mn BRT模型的预测变量相对影响的完整总结，请参阅相关期刊论文（Rosecrans等, 2017）。DO BRT模型的响应变量基于井位实测DO浓度，选取的阈值分别为<0.5毫克每升（mg/L）、<1.0 mg/L与<2.0 mg/L；依据文献调研，这些阈值对应的条件为缺氧环境。Mn BRT模型的响应变量基于井位实测Mn浓度，选取的超标阈值分别为>50微克每升（µg/L）、>150 µg/L与>300 µg/L。（其中150 µg/L的锰阈值为USGS HBSL的一半。）地表以下的预测格网离散方式为：122米深度范围内采用15米间隔，122米至300米深度范围内采用30米间隔；因此每种组分（DO与Mn）及每种阈值对应生成14张二维概率格网。本研究还分别以民用供水井与公共供水井的平均井深为代表，生成了浅层含水层与深层含水层的概率格网分布图。本研究采用46米作为训练数据集内井位的分层阈值，将其划分为浅层与深层含水层；该阈值源自Burow等（2013）先前研究中与民用及公共供水相关的井深百分位数。本研究中，归类为民用供水井的井位平均埋深为地表以下30米（bls），归类为公共供水井的井位平均埋深为地表以下100米（bls）。因此，名为"DO BRT预测格网.zip"与"Mn BRT预测格网.zip"的文件夹中各包含42张概率格网，分别对应上述DO与Mn BRT阈值模型中每种选定阈值的特定深度。名为"PublicSupply&DomesticGrids.zip"的文件夹中包含的数据集，对应上述6种BRT模型分别以民用供水与公共供水井深为代表的概率格网，总计12张格网。

创建时间：

2018-02-01