Combining Group Contribution Method and Semisupervised Learning to Build Machine Learning Models for Predicting Hydroxyl Radical Rate Constants of Water Contaminants
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Combining_Group_Contribution_Method_and_Semisupervised_Learning_to_Build_Machine_Learning_Models_for_Predicting_Hydroxyl_Radical_Rate_Constants_of_Water_Contaminants/28095176
下载链接
链接失效反馈官方服务:
资源简介:
Machine
learning is an effective tool for predicting reaction rate
constants for many organic compounds with the hydroxyl radical (HO•). Previously reported models have achieved relatively
good performance, but due to scarce data (<1400 records), the applicability
domain (AD) has been significantly limited. To address this limitation,
we curated a much larger experimental data set (Primary data set),
which contains 2358 kinetic records. We then employed both the group
contribution method (GCM) and a semisupervised learning (SSL) strategy
to add new data points, aiming to effectively expand the model’s
AD while improving model performance. The results indicated that GCM
improved the model’s performance for chemicals outside the
AD, while SSL expanded the model’s AD. The final model, after
incorporating 147,168 new data points, achieved an R2 = 0.77, root-mean-square-error = 0.32, and mean-absolute-error
= 0.24 on the test set. Importantly, the AD was expanded by 117% compared
to the model developed solely based on the Primary data set, and the
final model can be reliably applied to more than 560,000 chemicals
from the DSSTox database. Further model interpretation results indicated
that the model made predictions based on a correct “understanding”
of the impact of key substituents and reactive sites toward HO•. This research provides an effective method for augmenting
data sets, which is important in improving ML model performance and
expanding AD. The final model has been made widely accessible through
a free online predictor.
创建时间:
2024-12-26



