Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://figshare.com/articles/dataset/Blinded_Predictions_and_Post_Hoc_Analysis_of_the_Second_Solubility_Challenge_Data_Exploring_Training_Data_and_Feature_Set_Selection_for_Machine_and_Deep_Learning_Models/22064616
下载链接
链接失效反馈官方服务:
资源简介:
Accurate methods
to predict solubility from molecular structure
are highly sought after in the chemical sciences. To assess the state
of the art, the American Chemical Society organized a “Second
Solubility Challenge” in 2019, in which competitors were invited
to submit blinded predictions of the solubilities of 132 drug-like
molecules. In the first part of this article, we describe the development
of two models that were submitted to the Blind Challenge in 2019 but
which have not previously been reported. These models were based on
computationally inexpensive molecular descriptors and traditional
machine learning algorithms and were trained on a relatively small
data set of 300 molecules. In the second part of the article, to test
the hypothesis that predictions would improve with more advanced algorithms
and higher volumes of training data, we compare these original predictions
with those made after the deadline using deep learning models trained
on larger solubility data sets consisting of 2999 and 5697 molecules.
The results show that there are several algorithms that are able to
obtain near state-of-the-art performance on the solubility challenge
data sets, with the best model, a graph convolutional neural network,
resulting in an RMSE of 0.86 log units. Critical analysis of the models
reveals systematic differences between the performance of models using
certain feature sets and training data sets. The results suggest that
careful selection of high quality training data from relevant regions
of chemical space is critical for prediction accuracy but that other
methodological issues remain problematic for machine learning solubility
models, such as the difficulty in modeling complex chemical spaces
from sparse training data sets.
创建时间:
2023-02-09



