A comprehensive benchmark of single-cell Hi-C embedding tools

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE305523

下载链接

链接失效反馈

官方服务：

资源简介：

Embedding is the key step in single-cell Hi-C (scHi-C) analysis which relies on capturing biological meaningful heterogeneity at various levels of genome architecture. To understand the strength and limitations of existing tools in various applications, here we use ten scHi-C datasets to benchmark thirteen embedding tools including Va3DE, a new convolutional neural network model that can accommodate large cell numbers. We built a software framework to decouple the preprocessing options of existing tools and found that no single tool works best across all datasets under default settings. The difficulty levels and preferred resolutions are different between benchmark datasets, and the choice of data representation and preprocessing strongly impact the embedding performance. Embedding cells from early embryonic stages relies on long-range compartment-scale contacts, but resolving cell cycle phases and complex tissue requires short-range loop-scale contacts. Both random-walk and inverse document frequency (IDF) transformation prefers long-range “compartment-scale” over short-range “loop-scale” embedding, while deep-learning methods better overcome sparsity at both scales and are more versatile with different resolutions. Finally, “diagonal integration” with independent data modal is a promising approach to distinguish similar cell subpopulations. Our findings underscore the significance of appropriate priors for scHi-C embedding and offer new insights into genome architecture heterogeneity. Single-cell Hi-C datasets spanning early embryogenesis, cell cycle, and differentiated brain cells in both mouse and human. Data processing: Contact maps were generated by binning the pairs files to ~5-10kb resolution depending on restriction enzyme, and summing neighboring bins to produce lower resolution maps down to 1Mb resolution. Raw count values per-bin are provided without normalization. Processed data: tab-delimited file contains cell names and metadata Processed data: tarball contains scool files at multiple resolutions ********************************************************************** The table below lists GEO accessions reused/reanalyzed for this study. **********************************************************************

创建时间：

2025-08-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集