Automatic Prediction of Molecular Properties Using Substructure Vector Embeddings within a Feature Selection Workflow
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Automatic_Prediction_of_Molecular_Properties_Using_Substructure_Vector_Embeddings_within_a_Feature_Selection_Workflow/28083298
下载链接
链接失效反馈官方服务:
资源简介:
Machine learning
(ML) methods provide a pathway to accurately
predict
molecular properties, leveraging patterns derived from structure–property
relationships within materials databases. This approach holds significant
importance in drug discovery and materials design, where the rapid,
efficient screening of molecules can accelerate the development of
new pharmaceuticals and chemical materials for highly specialized
target application. Unsupervised and self-supervised learning methods
applied to graph-based or geometric models have garnered considerable
traction. More recently, transformer-based language models have emerged
as powerful tools. Nevertheless, their application entails considerable
computational resources, owing to the need for an extensive pretraining
process on a vast corpus of unlabeled chemical data sets. To this
end, we present a semisupervised strategy that harnesses substructure
vector embeddings in conjunction with a ML-based feature selection
workflow to predict various molecular and drug properties. We evaluate
the efficacy of our modeling methodology across a diverse range of
data sets, encompassing both regression and classification tasks.
Our findings demonstrate superior performance compared to most existing
state-of-the-art algorithms, while offering advantages in terms of
balancing model accuracy with computational requirements. Moreover,
our approach provides deeper insights into feature interactions that
are essential for model interpretability. A case study is conducted
to predict the lipophilicity of chemical molecules, exemplifying the
robustness of our strategy. The result underscores the importance
of meticulous feature analysis and selection over a mere reliance
on predictive modeling with a high degree of algorithmic complexity.
创建时间:
2024-12-23



