PySIDT: Subgraph Isomorphic Decision Trees for Molecular Property Prediction
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/PySIDT_Subgraph_Isomorphic_Decision_Trees_for_Molecular_Property_Prediction/30423174
下载链接
链接失效反馈官方服务:
资源简介:
Accurate molecular property prediction is important across
all
fields of chemistry. Deep neural networks (DNNs) have become increasingly
popular due to their ability to train automatically, avoiding the
incredibly tedious process of constructing and extending traditional
property estimation schemes. However, DNNs require large amounts of
training data, are challenging to interpret, require large amounts
of memory to load even during inference, and have severe difficulties
incorporating qualitative chemical knowledge, which are often desired
for molecular property prediction tasks. Here we present PySIDT (https://github.com/zadorlab/PySIDT), a software for training and running inference on Subgraph Isomorphic
Decision Trees (SIDTs). SIDTs are graph-based decision trees made
of nodes associated with molecular substructures. Inference is done
by descending target molecular structures down the decision tree to
nodes with matching subgraph isomorphic substructures and making predictions
based on the final (most specific) nodes matched. SIDTs scale down
well to dataset sizes much smaller than is feasible for DNNs. As trees
of molecular substructures, SIDTs are inherently readable and easy
to visualize, making them easy to analyze. They are also straightforward
to extend and retrain, facilitate uncertainty estimation, and enable
easy integration of expert knowledge. We demonstrate the SIDT approach
discussing its application to a diverse range of molecular prediction
tasks: rate coefficient estimation, diffusion coefficient estimation,
thermochemistry estimation, transition state bond stretch prediction,
pKa prediction, stability of molecular
structures, stability of surface structures, and prediction of surface
lateral interaction energetics. Additionally, we demonstrate the power
of the SIDT algorithms in two direct learning curve vanilla comparisons
with the popular DNN-based software Chemprop and the popular gradient
boosted trees-based software XGBoost on enthalpy of formation and
rate coefficient prediction tasks. In particular, in the enthalpy
of formation case, vanilla PySIDT is able to outperform vanilla Chemprop
and XGBoost across the full range of training/validation set sizes
out to 11,560 data points.
创建时间:
2025-10-22



