DataSheet1_Dimensionality Reduction and Louvain Agglomerative Hierarchical Clustering for Cluster-Specified Frequent Biomarker Discovery in Single-Cell Sequencing Data.CSV

Name: DataSheet1_Dimensionality Reduction and Louvain Agglomerative Hierarchical Clustering for Cluster-Specified Frequent Biomarker Discovery in Single-Cell Sequencing Data.CSV
Creator: frontiersin.figshare.com
Published: 2023-06-04 00:00:00
License: 暂无描述

frontiersin.figshare.com2023-06-04 更新2025-03-23 收录

下载链接：

https://frontiersin.figshare.com/articles/dataset/DataSheet1_Dimensionality_Reduction_and_Louvain_Agglomerative_Hierarchical_Clustering_for_Cluster-Specified_Frequent_Biomarker_Discovery_in_Single-Cell_Sequencing_Data_CSV/19129277/1

下载链接

链接失效反馈

官方服务：

资源简介：

The major interest domains of single-cell RNA sequential analysis are identification of existing and novel types of cells, depiction of cells, cell fate prediction, classification of several types of tumor, and investigation of heterogeneity in different cells. Single-cell clustering plays an important role to solve the aforementioned questions of interest. Cluster identification in high dimensional single-cell sequencing data faces some challenges due to its nature. Dimensionality reduction models can solve the problem. Here, we introduce a potential cluster specified frequent biomarkers discovery framework using dimensionality reduction and hierarchical agglomerative clustering Louvain for single-cell RNA sequencing data analysis. First, we pre-filtered the features with fewer number of cells and the cells with fewer number of features. Then we created a Seurat object to store data and analysis together and used quality control metrics to discard low quality or dying cells. Afterwards we applied global-scaling normalization method “LogNormalize” for data normalization. Next, we computed cell-to-cell highly variable features from our dataset. Then, we applied a linear transformation and linear dimensionality reduction technique, Principal Component Analysis (PCA) to project high dimensional data to an optimal low-dimensional space. After identifying fifty “significant”principal components (PCs) based on strong enrichment of low p-value features, we implemented a graph-based clustering algorithm Louvain for the cell clustering of 10 top significant PCs. We applied our model to a single-cell RNA sequential dataset for a rare intestinal cell type in mice (NCBI accession ID:GSE62270, 23,630 features and 1872 samples (cells)). We obtained 10 cell clusters with a maximum modularity of 0.885 1. After detecting the cell clusters, we found 3871 cluster-specific biomarkers using an expression feature extraction statistical tool for single-cell sequencing data, Model-based Analysis of Single-cell Transcriptomics (MAST) with a log 2FC threshold of 0.25 and a minimum feature detection of 25%. From these cluster-specific biomarkers, we found 1892 most frequent markers, i.e., overlapping biomarkers. We performed degree hub gene network analysis using Cytoscape and reported the five highest degree genes (Rps4x, Rps18, Rpl13a, Rps12 and Rpl18a). Subsequently, we performed KEGG pathway and Gene Ontology enrichment analysis of cluster markers using David 6.8 software tool. In summary, our proposed framework that integrated dimensionality reduction and agglomerative hierarchical clustering provides a robust approach to efficiently discover cluster-specific frequent biomarkers, i.e., overlapping biomarkers from single-cell RNA sequencing data.

单细胞RNA测序分析的主要关注领域包括现有及新型细胞的识别、细胞的描绘、细胞命运预测、多种肿瘤类型的分类，以及对不同细胞异质性的研究。单细胞聚类在解决上述感兴趣的问题中扮演着至关重要的角色。由于高维单细胞测序数据的特性，聚类识别面临一些挑战。降维模型能够解决这一问题。在此，我们介绍了一种基于降维和层次聚类Louvain算法的潜在聚类特异性频繁生物标志物发现框架，用于单细胞RNA测序数据分析。首先，我们对特征数量较少的细胞以及特征数量较少的细胞进行了预过滤。随后，我们创建了一个Seurat对象以存储数据和分析，并使用质量控制指标剔除低质量或死亡的细胞。接着，我们应用了全局缩放归一化方法“LogNormalize”进行数据归一化。随后，我们计算了数据集中细胞间的显著可变特征。然后，我们应用了线性变换和线性降维技术，即主成分分析（PCA），将高维数据投影到最优的低维空间。基于低p值特征显著富集，我们基于10个最具显著性的主成分（PCs）实施了基于图的聚类算法Louvain进行细胞聚类。我们将我们的模型应用于小鼠稀有肠道细胞类型的单细胞RNA测序数据集（NCBI登录号：GSE62270，23,630个特征和1,872个样本（细胞））。我们获得了10个细胞簇，其最大模块度为0.8851。在检测到细胞簇后，我们使用单细胞测序数据的表达特征提取统计工具，即基于模型的单细胞转录组分析（MAST），以log2FC阈值为0.25和最小特征检测阈值为25%的标准，发现了3,871个簇特异性生物标志物。从这些簇特异性生物标志物中，我们找到了1,892个最频繁的标记，即重叠的生物标志物。我们使用Cytoscape进行了度中心基因网络分析，并报告了五个最高度基因（Rps4x、Rps18、Rpl13a、Rps12和Rpl18a）。随后，我们使用David 6.8软件工具对簇标记进行了KEGG通路和基因本体富集分析。总之，我们提出的将降维和聚集分层聚类集成的框架，为从单细胞RNA测序数据中高效发现簇特异性频繁生物标志物，即重叠生物标志物提供了一种稳健的方法。

提供机构：

frontiersin.figshare.com