Automatic Spectroscopic Data Categorization by Clustering Analysis (ASCLAN): A Data-Driven Approach for Distinguishing Discriminatory Metabolites for Phenotypic Subclasses
收藏NIAID Data Ecosystem2026-03-09 收录
下载链接:
https://figshare.com/articles/dataset/Automatic_Spectroscopic_Data_Categorization_by_Clustering_Analysis_ASCLAN_A_Data_Driven_Approach_for_Distinguishing_Discriminatory_Metabolites_for_Phenotypic_Subclasses/3380233
下载链接
链接失效反馈官方服务:
资源简介:
We propose a novel
data-driven approach aiming to reliably distinguish
discriminatory metabolites from nondiscriminatory metabolites for
a given spectroscopic data set containing two biological phenotypic
subclasses. The automatic spectroscopic data categorization by clustering
analysis (ASCLAN) algorithm aims to categorize spectral variables
within a data set into three clusters corresponding to noise, nondiscriminatory
and discriminatory metabolites regions. This is achieved by clustering
each spectral variable based on the r2 value representing the loading weight of each spectral variable
as extracted from a orthogonal partial least-squares discriminant
(OPLS-DA) model of the data set. The variables are ranked according
to r2 values and a series of principal
component analysis (PCA) models are then built for subsets of these
spectral data corresponding to ranges of r2 values. The Q2X value
for each PCA model is extracted. K-means clustering is then applied
to the Q2X values to
generate two clusters based on minimum Euclidean distance criterion.
The cluster consisting of lower Q2X values is deemed devoid of metabolic information (noise),
while the cluster consists of higher Q2X values is then further subclustered into two groups
based on the r2 values. We considered the cluster with
high Q2X but low r2 values as nondiscriminatory, while the cluster
with high Q2X and r2 values as discriminatory variables. The boundaries
between these three clusters of spectral variables, on the basis of
the r2 values were considered as the cut
off values for defining the noise, nondiscriminatory and discriminatory
variables. We evaluated the ASCLAN algorithm using six simulated 1H NMR spectroscopic data sets representing small, medium and
large data sets (N = 50, 500, and 1000 samples per
group, respectively), each with a reduced and full resolution set
of variables (0.005 and 0.0005 ppm, respectively). ASCLAN correctly
identified all discriminatory metabolites and showed zero false positive
(100% specificity and positive predictive value) irrespective of the
spectral resolution or the sample size in all six simulated data sets.
This error rate was found to be superior to existing methods for ascertaining
feature significance: univariate t test by Bonferroni
correction (up to 10% false positive rate), Benjamini–Hochberg
correction (up to 35% false positive rate) and metabolome wide significance
level (MWSL, up to 0.4% false positive rate), as well as by various
OPLS-DA parameters: variable importance to projection, (up to 15%
false positive rate), loading coefficients (up to 35% false positive
rate), and regression coefficients (up to 39% false positive rate).
The application of ASCLAN was further exemplified using a widely investigated
renal toxin, mercury II chloride (HgCl2) in rat model.
ASCLAN successfully identified many of the known metabolites related
to renal toxicity such as increased excretion of urinary creatinine,
and different amino acids. The ASCLAN algorithm provides a framework
for reliably differentiating discriminatory metabolites from nondiscriminatory
metabolites in a biological data set without the need to set an arbitrary
cut off value as applied to some of the conventional methods. This
offers significant advantages over existing methods and the possibility
for automation of high-throughput screening in “omics”
data.
创建时间:
2016-06-27



