Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://figshare.com/articles/dataset/Prediction_of_35_Target_Per-_and_Polyfluoroalkyl_Substances_PFASs_in_California_Groundwater_Using_Multilabel_Semisupervised_Machine_Learning/23989137
下载链接
链接失效反馈官方服务:
资源简介:
Comprehensive monitoring
of perfluoroalkyl and polyfluoroalkyl
substances (PFASs) is challenging because of the high analytical cost
and an increasing number of analytes. We developed a machine learning
pipeline to understand environmental features influencing PFAS profiles
in groundwater. By examining 23 public data sets (2016–2022)
in California, we built a state-wide groundwater database (25,000
observations across 4200 wells) encompassing contamination sources,
weather, air quality, soil, hydrology, and groundwater quality (PFASs
and cocontaminants). We used supervised learning to prescreen total
PFAS concentrations above 70 ng/L and multilabel semisupervised learning
to predict 35 individual PFAS concentrations above 2 ng/L. Random
forest with ADASYN oversampling performed the best for total PFASs
(AUROC 99%). XGBoost with SMOTE oversampling achieved the AUROC of
73–100% for individual PFAS prediction. Contamination sources
and soil variables contributed the most to accuracy. Individual PFASs
were strongly correlated within each PFAS’s subfamily (i.e.,
short- vs long-chain PFCAs, sulfonamides). These associations improved
prediction performance using classifier chains, which predicts a PFAS
based on previously predicted species. We applied the model to reconstruct
PFAS profiles in groundwater wells with missing data in previous years.
Our approach can complement monitoring programs of environmental agencies
to validate previous investigation results and prioritize sites for
future PFAS sampling.
创建时间:
2023-08-18



