BioSM: Metabolomics Tool for Identifying Endogenous Mammalian Biochemical Structures in Chemical Structure Space
收藏NIAID Data Ecosystem2026-03-09 收录
下载链接:
https://figshare.com/articles/dataset/BioSM_Metabolomics_Tool_for_Identifying_Endogenous_Mammalian_Biochemical_Structures_in_Chemical_Structure_Space/2431600
下载链接
链接失效反馈官方服务:
资源简介:
The
structural identification of unknown biochemical compounds
in complex biofluids continues to be a major challenge in metabolomics
research. Using LC/MS, there are currently two major options for solving
this problem: searching small biochemical databases, which often do
not contain the unknown of interest or searching large chemical databases
which include large numbers of nonbiochemical compounds. Searching
larger chemical databases (larger chemical space) increases the odds
of identifying an unknown biochemical compound, but only if nonbiochemical
structures can be eliminated from consideration. In this paper we
present BioSM; a cheminformatics tool that uses known endogenous mammalian
biochemical compounds (as scaffolds) and graph matching methods to
identify endogenous mammalian biochemical structures in chemical structure
space. The results of a comprehensive set of empirical experiments
suggest that BioSM identifies endogenous mammalian biochemical structures
with high accuracy. In a leave-one-out cross validation experiment,
BioSM correctly predicted 95% of 1388 Kyoto Encyclopedia of Genes
and Genomes (KEGG) compounds as endogenous mammalian biochemicals
using 1565 scaffolds. Analysis of two additional biological data sets
containing 2330 human metabolites (HMDB) and 2416 plant secondary
metabolites (KEGG) resulted in biochemical annotations of 89% and
72% of the compounds, respectively. When a data set of 3895 drugs
(DrugBank and USAN) was tested, 48% of these structures were predicted
to be biochemical. However, when a set of synthetic chemical compounds
(Chembridge and Chemsynthesis databases) were examined, only 29% of
the 458 207 structures were predicted to be biochemical. Moreover,
BioSM predicted that 34% of 883 199 randomly selected compounds
from PubChem were biochemical. We then expanded the scaffold list
to 3927 biochemical compounds and reevaluated the above data sets
to determine whether scaffold number influenced model performance.
Although there were significant improvements in model sensitivity
and specificity using the larger scaffold list, the data set comparison
results were very similar. These results suggest that additional biochemical
scaffolds will not further improve our representation of biochemical
structure space and that the model is reasonably robust. BioSM provides
a qualitative (yes/no) and quantitative (ranking) method for endogenous
mammalian biochemical annotation of chemical space and, thus, will
be useful in the identification of unknown biochemical structures
in metabolomics. BioSM is freely available at http://metabolomics.pharm.uconn.edu.
创建时间:
2016-02-19



