Improving Machine Learning Classification Predictions through SHAP and Features Analysis Interpretation
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Improving_Machine_Learning_Classification_Predictions_through_SHAP_and_Features_Analysis_Interpretation/30399883
下载链接
链接失效反馈官方服务:
资源简介:
Tree-based machine
learning (ML) algorithms, such as Extra Trees
(ET), Random Forest (RF), Gradient Boosting Machine (GBM), and XGBoost
(XGB) are among the most widely used in early drug discovery, given
their versatility and performance. However, models based on these
algorithms often suffer from misclassification and reduced interpretability
issues, which limit their applicability in practice. To address these
challenges, several approaches have been proposed, including the use
of SHapley Additive Explanations (SHAP). While SHAP values are commonly
used to elucidate the importance of features driving models’
predictions, they can also be employed in strategies to improve their
prediction performance. Building on these premises, we propose a novel
approach that integrates SHAP and features value analyses to reduce
misclassification in model predictions. Specifically, we benchmarked
classifiers based on ET, RF, GBM, and XGB algorithms using data sets
of compounds with known antiproliferative activity against three prostate
cancer (PC) cell lines (i.e., PC3, LNCaP, and DU-145).
The best-performing models, based on RDKit and ECFP4 descriptors with
GBM and XGB algorithms, achieved MCC values above 0.58 and F1-score
above 0.8 across all data sets, demonstrating satisfactory accuracy
and precision. Analyses of SHAP values revealed that many misclassified
compounds possess feature values that fall within the range typically
associated with the opposite class. Based on these findings, we developed
a misclassification-detection framework using four filtering rules,
which we termed “RAW”, SHAP, “RAW OR SHAP”,
and “RAW AND SHAP”. These filtering rules successfully
identified several potentially misclassified predictions, with the
“RAW OR SHAP” rule retrieving up to 21%, 23%, and 63%
of misclassified compounds in the PC3, DU-145, and LNCaP test sets,
respectively. The developed flagging rules enable the systematic exclusion
of likely misclassified compounds, even across progressively higher
prediction confidence levels, thus providing a valuable approach to
improve classifier performance in virtual screening applications.
创建时间:
2025-10-20



