Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants’ Activities and Properties
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://figshare.com/articles/dataset/Count-Based_Morgan_Fingerprint_A_More_Efficient_and_Interpretable_Molecular_Representation_in_Developing_Machine_Learning-Based_Predictive_Regression_Models_for_Water_Contaminants_Activities_and_Properties/23631610
下载链接
链接失效反馈官方服务:
资源简介:
In this study, we introduce the count-based
Morgan fingerprint
(C-MF) to represent chemical structures of contaminants and develop
machine learning (ML)-based predictive models for their activities
and properties. Compared with the binary Morgan fingerprint (B-MF),
C-MF not only qualifies the presence or absence of an atom group but
also quantifies its counts in a molecule. We employ six different
ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost)
to develop models on 10 contaminant-related data sets based on C-MF
and B-MF to compare them in terms of the model’s predictive
performance, interpretation, and applicability domain (AD). Our results
show that C-MF outperforms B-MF in nine of 10 data sets in terms of
model predictive performance. The advantage of C-MF over B-MF is dependent
on the ML algorithm, and the performance enhancements are proportional
to the difference in the chemical diversity of data sets calculated
by B-MF and C-MF. Model interpretation results show that the C-MF-based
model can elucidate the effect of atom group counts on the target
and have a wider range of SHAP values. AD analysis shows that C-MF-based
models have an AD similar to that of B-MF-based ones. Finally, we
developed a “ContaminaNET” platform to deploy these
C-MF-based models for free use.
创建时间:
2023-07-05



