PHYSPROPNET: A Benchmarking Study of Machine Learning Models for Physicochemical Property Prediction in Data-Limited Environmental Research
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/PHYSPROPNET_A_Benchmarking_Study_of_Machine_Learning_Models_for_Physicochemical_Property_Prediction_in_Data-Limited_Environmental_Research/30633599
下载链接
链接失效反馈官方服务:
资源简介:
Machine learning is increasingly
applied in environmental
chemistry
for contaminant screening and property prediction, yet consistent
benchmarks are lacking. We compared eight graph neural networks (GNNs)
and nine conventional learning algorithms combined with five molecular
descriptor and fingerprint sets across 11 physicochemical and environmental
fate properties from the U.S. EPA PHYSPROP database. Data set sizes
ranged from 10,652 compounds for LogP to 150 for half-life (LogHL)
with evaluation under both random and scaffold splits. Model accuracy
depended primarily on data set size and molecular representation.
For end points containing fewer than ∼1,000 compounds, descriptor-based
models using RDKit or Mordred features with LightGBM, CatBoost, or
Random Forest matched or outperformed GNNs while requiring less training
time. For larger data sets, GNNs achieved comparable or higher accuracy.
A composite ranking that integrated accuracy, error, and computational
cost identified LightGBM with RDKit descriptors as the most effective
overall configuration. Feature attribution analyses confirmed that
both descriptor and graph models captured chemically interpretable
structure–property relationships. These results provide practical
guidance, showing that descriptor models are best suited for small
to moderate data sets or applications requiring transparency and high
throughput, whereas graph models become advantageous as data scale
increases or when richer molecular context is needed.
创建时间:
2025-11-17



