SiatBioInf/SingleCell-Unseen-Benchmark

Name: SiatBioInf/SingleCell-Unseen-Benchmark
Creator: SiatBioInf
Published: 2026-01-13 01:58:52
License: 暂无描述

Hugging Face2026-01-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/SiatBioInf/SingleCell-Unseen-Benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en task_categories: - other tags: - biology - single-cell - scRNA-seq - h5ad - cancer - tumor - benchmark - bioinformatics size_categories: - 1M<n<10M --- # SingleCell-Unseen-Benchmark ## Overview **SingleCell-Unseen-Benchmark** is a large-scale unseen single-cell transcriptomic benchmark designed to systematically evaluate foundation models on cell identification and cell type tracing tasks. The benchmark covers **tumor, stem, neural, and normal cell populations**, with a particular emphasis on **unseen data distributions**, including rare cell types, cross-dataset generalization, and heterogeneous tumor states. In addition to curated datasets, this repository provides **standardized benchmark results** for multiple single-cell foundation models, enabling transparent and reproducible comparison. --- ## Dataset Collection ### Tumor Cells - **Source**: GEO - **Cancer types**: 21 - **Samples**: 2,225 - **Cells**: 1,645,662 - **Cell states**: Primary tumors, metastases, circulating tumor cells (CTCs) ### Stem Cells - **Source**: CELLxGENE - **Datasets**: 5 - **Cells**: 325,092 - **Stem cell types**: 4 ### Neural Cells - **Source**: CELLxGENE - **Datasets**: 1 - **Cells**: 423,707 - **Neural cell types**: 6 ### Normal Cells - **Source**: CELLxGENE - **Datasets**: 7 - **Cells**: 1,838,991 - **Normal cell types**: 10 ### Preprocessing - All genes were mapped to **HGNC symbols** - Cells with fewer than **200 detected genes** were removed - Expression matrices are stored in **AnnData (`.h5ad`) format** --- ## Cell Type and Malignancy Annotation Strategy Tumor cells derived from GEO were re-identified using a **consensus workflow**: 1. **Lineage-level screening** based on **CancerSCEM 2.0** marker genes 2. **Malignancy confirmation** using **inferCNV** CELLxGENE-derived datasets retain their **original annotations**. This strategy ensures consistent tumor labeling while minimizing dataset-specific bias. --- ## Downstream Benchmark Tasks The benchmark evaluates foundation models across multiple biologically meaningful tasks: | Category | Task | Prediction Type | |--------|------|-----------------| | Tumor | Tumor cell identification | Binary | | Tumor | Primary site tracing | Multi-class | | Stem | Stem cell identification | Binary | | Stem | Stem cell subtype classification | Multi-class | | Neural | Neural cell identification | Binary | | Neural | Neural cell subtype classification | Multi-class | Models take **high-dimensional cell embeddings** as input and perform prediction using **lightweight downstream classifiers**, isolating representation quality from classifier complexity. --- ## Benchmark Models The following single-cell foundation models are evaluated: - **Geneformer** - **scFoundation** - **scGPT** - **UCE** - **scLONG** --- ## Evaluation Metrics - **Binary classification tasks** - Accuracy - Precision - Recall - F1-score - **Multi-class classification tasks** - Accuracy - Macro-Precision - Macro-Recall - Macro-F1 --- ## Data Format and Access ### Data Files All datasets are provided in **AnnData (`.h5ad`) format**. > **Note** > `.h5ad` files are not natively supported by the Hugging Face Dataset Viewer. > Users are expected to download the files and load them locally using standard single-cell analysis tools such as **Scanpy** or **Seurat**. ## Benchmark Results In addition to raw datasets, we provide **complete benchmark evaluation results** under the `results/` directory. ### Design Rationale - **`by_model/`** Provides a **model-centric view**, facilitating analysis of how a single model performs across different tasks. - **`by_task/`** Provides a **task-centric view**, enabling direct comparison of multiple models on the same task. Both views contain **identical information** and are provided to improve usability, clarity, and reproducibility. --- ## Intended Use This benchmark is intended for: - Evaluating **generalization and robustness** of single-cell foundation models - Studying **tumor cell identification and origin tracing** under unseen conditions - Benchmarking representation quality across diverse biological contexts The dataset is **not intended for clinical decision-making**. --- ## Citation If you use this dataset or benchmark in your work, please cite: ## Contact For questions, issues, or suggestions, please open an issue on the Hugging Face repository.

提供机构：

SiatBioInf

5,000+

优质数据集

54 个

任务类型

进入经典数据集