Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach
收藏NIAID Data Ecosystem2026-03-10 收录
下载链接:
https://figshare.com/articles/dataset/Chemical_Topic_Modeling_Exploring_Molecular_Data_Sets_Using_a_Common_Text-Mining_Approach/5294899
下载链接
链接失效反馈官方服务:
资源简介:
Big
data is one of the key transformative factors which increasingly influences
all aspects of modern life. Although this transformation brings vast
opportunities it also generates novel challenges, not the least of
which is organizing and searching this data deluge. The field of medicinal
chemistry is not different: more and more data are being generated,
for instance, by technologies such as DNA encoded libraries, peptide
libraries, text mining of large literature corpora, and new in silico
enumeration methods. Handling those huge sets of molecules effectively
is quite challenging and requires compromises that often come at the
expense of the interpretability of the results. In order to find an
intuitive and meaningful approach to organizing large molecular data
sets, we adopted a probabilistic framework called “topic modeling”
from the text-mining field. Here we present the first chemistry-related
implementation of this method, which allows large molecule sets to
be assigned to “chemical topics” and investigating the
relationships between those. In this first study, we thoroughly evaluate
this novel method in different experiments and discuss both its disadvantages
and advantages. We show very promising results in reproducing human-assigned
concepts using the approach to identify and retrieve chemical series
from sets of molecules. We have also created an intuitive visualization
of the chemical topics output by the algorithm. This is a huge benefit
compared to other unsupervised machine-learning methods, like clustering,
which are commonly used to group sets of molecules. Finally, we applied
the new method to the 1.6 million molecules of the ChEMBL22 data set
to test its robustness and efficiency. In about 1 h we built a 100-topic
model of this large data set in which we could identify interesting
topics like “proteins”, “DNA”, or “steroids”.
Along with this publication we provide our data sets and an open-source
implementation of the new method (CheTo) which will be part of an
upcoming version of the open-source cheminformatics toolkit RDKit.
创建时间:
2017-08-09



