iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://figshare.com/articles/dataset/iRaPCA_and_SOMoC_Development_and_Validation_of_Web_Applications_for_New_Approaches_for_the_Clustering_of_Small_Molecules/20052358
下载链接
链接失效反馈官方服务:
资源简介:
The
clustering of small molecules implies the organization of a
group of chemical structures into smaller subgroups with similar features.
Clustering has important applications to sample chemical datasets
or libraries in a representative manner (e.g., to choose, from a virtual
screening hit list, a chemically diverse subset of compounds to be
submitted to experimental confirmation, or to split datasets into
representative training and validation sets when implementing machine
learning models). Most strategies for clustering molecules are based
on molecular fingerprints and hierarchical clustering algorithms.
Here, two open-source in-house methodologies for clustering of small
molecules are presented: iterative Random subspace Principal Component
Analysis clustering (iRaPCA), an iterative approach based on feature
bagging, dimensionality reduction, and K-means optimization; and Silhouette
Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints
with the Uniform Manifold Approximation and Projection (UMAP) and
Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise,
the performance of both clustering methods has been examined across
29 datasets containing between 100 and 5000 small molecules, comparing
these results with those given by two other well-known clustering
methods, Ward and Butina. iRaPCA and SOMoC consistently showed the
best performance across these 29 datasets, both in terms of within-cluster
and between-cluster distances. Both iRaPCA and SOMoC have been implemented
as free Web Apps and standalone applications, to allow their use to
a wide audience within the scientific community.
创建时间:
2022-06-10



