Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores
收藏DataCite Commons2025-06-01 更新2025-05-10 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.zgmsbccg7
下载链接
链接失效反馈官方服务:
资源简介:
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on
the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are
provided in the xyz format containing the AutoDock Vina docking scores.
These files were used as input and/or reference in the machine learning
models using TensorFlow, XGBoost, and SchNetPack to study their docking
scores prediction capability. The first data set originally contained
60,411 in-vivo labeled compounds selected for the training of ML models.
The second data set,denoted as in-vitro-only, originally contained 175,696
compounds active or assumed to be active at 10 μM or less in a direct
binding assay. These sets were downloaded on the 10th of December 2021
from the ZINC15 database. Four compounds in the in-vivo set and 12 in the
in-vitro-only set were left out of consideration due to presence of Si
atoms. Compounds with no charges assigned in mol2 files were excluded as
well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set).
Gasteiger charges were reassigned to the remaining compounds using
OpenBabel. In addition, four in-vitro-only compounds with docking scores
greater than 1 kcal/mol have been rejected. The provided in-vivo and the
in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014
(in-vitro-only.xyz) compounds, respectively. Compounds in both sets
contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The
in-vivo compound set was used as the primary data set for the training of
the ML models in the referencing study. The file in-vivo-splits-data.csv
contains the exact composition of all (random) 80-5-15
train-validation-test splits used in the study, labeled I, II, III, IV,
and V. Eight additional random subsets in each of the in-vivo 80-5-15
splits were created to monitor the training process convergence. These
subsets were constructed in such a manner, that each subset contains all
compounds from the previous subset (starting with the 10-5-15 subset) and
was enlarged by one eighth of the entire (80-5-15) train set of a given
split. These subsets are further referred to as in_vivo_10_(I, II, ...,
V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V).
提供机构:
Dryad
创建时间:
2023-03-03



