Automated Identification of Chemical Series: Classifying like a Medicinal Chemist
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://figshare.com/articles/dataset/Automated_Identification_of_Chemical_Series_Classifying_like_a_Medicinal_Chemist/12424430
下载链接
链接失效反馈官方服务:
资源简介:
We
investigate different automated approaches for the classification
of chemical series in early drug discovery, with the aim of closely
mimicking human chemical series conception. Chemical series, which
are commonly defined by hand-drawn scaffolds, organize datasets in
drug discovery projects. Often, they form the basis for further project
decisions. To trace and evaluate these decisions in historic and ongoing
projects, it is important to know or reconstruct chemical series.
There is not a unique correct definition of chemical series, and the
human definition certainly involves a subjective bias. Hence, we first
develop quality metrics for the chemical series definitions, evaluating
the size and specificity of chemical series. These metrics are applied
to categorize human series definitions and implemented in automated
classification approaches. For the automated classification of chemical
series, we test different fragmentation and similarity-based clustering
algorithms and apply different approaches to infer series definitions
from these clusters or sets of fragments. We benchmark the classification
results against human-defined series from 30 internal projects. The
best results in reproducing the composition of human-defined series
are achieved when applying UPGMA (unweighted pair group method with
arithmetic mean) clustering to the project dataset and calculating
maximum common substructures of the clusters as series definitions.
We evaluate this approach in more detail on a public dataset and assess
its robustness by 10-fold cross-validation, each time sampling 40%
of the dataset. Through these benchmarking and validation experiments,
we show that the proposed automated approach is able to accurately
and robustly identify human-defined series, which comply with a certain,
predefined level of specificity and size. Suggesting a thoroughly
tested algorithm for series classification, as well as quality metrics
for series and several benchmarking approaches, this work lays the
foundation for further analysis of project decisions, and it offers
an enhanced understanding of the properties of human-defined chemical
series.
创建时间:
2020-05-06



