five

valsv/scrna-coregulation-benchmark

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/valsv/scrna-coregulation-benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - tabular-regression tags: - single-cell - scRNA-seq - gene-expression - normalization - benchmark - coregulation - coexpression - anndata size_categories: - 100K<n<1M pretty_name: scRNA-seq Coregulation Benchmark --- # scRNA-seq Coregulation Benchmark A benchmark for evaluating whether single-cell RNA-seq normalization methods preserve known gene-gene correlation structure. It provides two complementary ground-truth catalogs: 1. **Promoter-reporter catalog** — Datasets where a fluorescent reporter (GFP/DsRed) is driven by a known gene's promoter. The reporter and its target gene should be *positively correlated*. 2. **Allelic exclusion catalog** — PBMC and B cell datasets where immunoglobulin light chain allelic exclusion (IGKC vs IGLC) provides an expected *negative correlation*. Together, these test both directions of the correlation spectrum: a good normalization method should recover positive coregulation where it exists and preserve anti-correlation where biology demands it. ## Quick start ```python from huggingface_hub import hf_hub_download import anndata as ad # Promoter-reporter example path = hf_hub_download( repo_id="valsv/scrna-coregulation-benchmark", filename="promoter_reporter/GSE316394_BACHD_1.h5ad", repo_type="dataset", ) adata = ad.read_h5ad(path) reporter = adata.uns["reporters"]["eGFP"] reporter["target_gene_symbol"] # "Dlx1" — the gene whose promoter drives eGFP # Allelic exclusion example path = hf_hub_download( repo_id="valsv/scrna-coregulation-benchmark", filename="allelic_exclusion/GSE306378_N_rep1.h5ad", repo_type="dataset", ) adata = ad.read_h5ad(path) pair = adata.uns["exclusion_pairs"]["IGKC_vs_IGLC2"] pair["gene_a_symbol"] # "IGKC" pair["gene_b_symbol"] # "IGLC2" ``` ## Repository structure ``` promoter_reporter/ GSE160772.h5ad GSE181864.h5ad GSE198556_*.h5ad (4 files) GSE229976_*.h5ad (2 files) GSE295703_*.h5ad (3 files) GSE296504.h5ad GSE316394_*.h5ad (4 files) GSE319345_*.h5ad (4 files) allelic_exclusion/ GSE260943_*.h5ad (3 files) GSE285843_*.h5ad (6 files) GSE306378_*.h5ad (6 files) ``` ## File format Each `.h5ad` file is one sample (one 10x or Parse capture). Files are named `{series_id}_{sample_suffix}.h5ad`. ### `X` — Count matrix Sparse CSR, dtype `int32`. Raw UMI counts (not normalized). Rows are cells, columns are genes. ### `var` — Gene annotations | Field | Description | |-------|-------------| | `var_names` (index) | Gene symbols (mouse), TAIR locus IDs (Arabidopsis), or gene symbols (human) | | `gene_id` | Ensembl or TAIR ID | ### `obs` — Cell metadata | Field | Type | Description | |-------|------|-------------| | `total_counts` | int | Total UMI per cell | | `n_genes` | int | Number of genes with at least one count | ### `uns` — Sample metadata | Field | Description | |-------|-------------| | `sample_id` | GEO sample accession | | `series_id` | GEO series accession | | `species` | Species | | `tissue` | Tissue or cell population | | `platform` | Sequencing platform/chemistry | Promoter-reporter files additionally have `uns["reporters"]` and allelic exclusion files have `uns["exclusion_pairs"]` (see below). ## Promoter-reporter catalog 20 harmonized scRNA-seq h5ad files where a fluorescent reporter gene is driven by a known gene's promoter, providing ground-truth positive coregulation at single-cell resolution. ### Reporter metadata — `uns["reporters"]` Dict keyed by the reporter's name in `var_names`: ```python {"eGFP": {"target_gene_symbol": "Pdgfrb", "target_gene_id": "ENSMUSG00000024620.13", "construct": "Pdgfrb-BAC-eGFP"}} ``` ### Evaluation For each sample and reporter, compute: 1. **Target correlation**: Pearson r between the reporter and its target gene (expected positive) 2. **Background correlations**: Pearson r between the reporter and N random non-reporter genes The target correlation should be substantially higher than the median background correlation. ```python import numpy as np from scipy.stats import pearsonr reporter_name = "eGFP" target_name = adata.uns["reporters"][reporter_name]["target_gene_symbol"] totals = np.asarray(adata.X.sum(axis=1)).ravel() reporter_norm = np.log10(1e4 * adata[:, reporter_name].X.toarray().ravel() / totals + 1) target_norm = np.log10(1e4 * adata[:, target_name].X.toarray().ravel() / totals + 1) target_r = pearsonr(reporter_norm, target_norm)[0] rng = np.random.default_rng(42) bg_genes = rng.choice( [g for g in adata.var_names if g != reporter_name and g != target_name], size=500, replace=False, ) bg_cors = [pearsonr(reporter_norm, np.log10(1e4 * adata[:, g].X.toarray().ravel() / totals + 1))[0] for g in bg_genes] print(f"Target r: {target_r:.3f}, Background median: {np.median(bg_cors):.3f}") ``` ### Datasets #### Mouse (17 files) | Series | Files | Reporter | Target gene | Tissue | Construct | Platform | Cells | |--------|-------|----------|-------------|--------|-----------|----------|-------| | GSE160772 | 1 | eGFP | Pdgfrb | Endometrium mesenchyme | BAC transgene | 10x v2 | 6,514 | | GSE198556 | 4 | eGFP | Pdgfrb | Endometrium (injury time-course) | BAC transgene | 10x v3 | 49,723 | | GSE181864 | 1 | eGFP | Rorc | Large intestine LP | Knockin | 10x v3 | 9,107 | | GSE229976 | 2 | eGFP | Il23r | Small intestine | Knockin | 10x v3 | 27,314 | | GSE296504 | 1 | eGFP + DsRed | Cx3cr1, Cspg4 | P15 eardrum | Knockin + transgene | 10x v3.1 | 4,548 | | GSE316394 | 4 | eGFP | Dlx1 | E12.5 MGE | BAC transgene | 10x v3.1 | 42,755 | | GSE319345 | 4 | eGFP | Sox9 | Liver (BDL model) | BAC transgene | Parse WT v1 | 19,819 | #### Arabidopsis (3 files) | Series | Files | Reporter | Target gene | Tissue | Construct | Platform | Cells | |--------|-------|----------|-------------|--------|-----------|----------|-------| | GSE295703 | 3 | GFP | WER, CORTEX, SCR | Root | Promoter fusion | 10x v3 | 32,078 | ### Notes - All 16 standard mouse files share the same 78,335 genes in the same order. GSE296504 has one additional gene (DsRed, 78,336 total). The three Arabidopsis files have 32,834 genes each. - Mouse gene references are from Ensembl GRCm39, augmented with eGFP (and DsRed for GSE296504). - Construct types: knockin (reporter inserted at the endogenous locus), BAC transgene (reporter in a bacterial artificial chromosome), promoter fusion (reporter driven by a cloned proximal promoter). ## Allelic exclusion catalog 15 human scRNA-seq h5ad files for benchmarking using immunoglobulin light chain allelic exclusion. Each B cell commits to either kappa (IGKC) or lambda (IGLC2/IGLC3) light chain expression — never both — providing an expected anti-correlation signal. ### Exclusion pair metadata — `uns["exclusion_pairs"]` ```python {"IGKC_vs_IGLC2": {"gene_a_symbol": "IGKC", "gene_a_id": "ENSG00000211592", "gene_b_symbol": "IGLC2", "gene_b_id": "ENSG00000211677", "mechanism": "Immunoglobulin light chain allelic exclusion"}} ``` ### Evaluation In mixed populations (PBMC), most cells express neither light chain. Filter to B cells first to avoid Simpson's paradox: ```python igkc = adata[:, "IGKC"].X.toarray().ravel() iglc2 = adata[:, "IGLC2"].X.toarray().ravel() iglc3 = adata[:, "IGLC3"].X.toarray().ravel() b_cell_mask = (igkc > 0) | (iglc2 > 0) | (iglc3 > 0) adata_b = adata[b_cell_mask] ``` Then compute target correlation (expected negative) vs. background, excluding all immunoglobulin genes (IGK\*, IGL\*, IGH\*) from the background pool. ### Datasets | Series | Files | Condition | Tissue | Platform | Cells | |--------|-------|-----------|--------|----------|-------| | GSE306378 | 6 | 3 healthy + 3 SLE | PBMC | 10x | 78,851 | | GSE285843 | 6 | healthy (3 donors x 2 platforms) | PBMC | 10x + Parse | 72,080 | | GSE260943 | 3 | healthy (3 donors) | Tonsil B cells | 10x | 47,978 | ### Notes - All 15 files share the same 33,694 genes (GRCh38, Cell Ranger reference). - GSE260943 samples are sorted tonsil B cells — B cell filtering is optional. - GSE306378 SLE samples have elevated B cell / plasma cell fractions. ## Citations If you use this benchmark, please cite the original studies that generated the data. ### Promoter-reporter catalog **GSE160772** — Kirkwood PM, Gibson DA, Smith JR, Wilson-Kanamori JR, Kelepouri O, Esnal-Zufiaurre A, Dobie R, Henderson NC, Saunders PTK. Single-cell RNA sequencing redefines the mesenchymal cell landscape of mouse endometrium. *FASEB J.* 2021;35:e21285. [doi:10.1096/fj.202002123R](https://doi.org/10.1096/fj.202002123R) **GSE198556** — Kirkwood PM, Gibson DA, Shaw I, Dobie R, Kelepouri O, Henderson NC, Saunders PTK. Single-cell RNA sequencing and lineage tracing confirm mesenchyme to epithelial transformation (MET) contributes to repair of the endometrium at menstruation. *eLife.* 2022;11:e77663. [doi:10.7554/eLife.77663](https://doi.org/10.7554/eLife.77663) **GSE181864** — Zhou W, Zhou L, Zhou J, Chu C, Zhang C, Sockolow RE, Eberl G, Sonnenberg GF. ZBTB46 defines and regulates ILC3s that protect the intestine. *Nature.* 2022;609(7925):159–165. [doi:10.1038/s41586-022-04934-4](https://doi.org/10.1038/s41586-022-04934-4) **GSE229976** — Ahmed A, Joseph AM, Zhou J, Horn V, Uddin J, Lyu M, Goc J, et al. CTLA-4-expressing ILC3s restrain interleukin-23-mediated inflammation. *Nature.* 2024;630:976–983. [doi:10.1038/s41586-024-07537-3](https://doi.org/10.1038/s41586-024-07537-3) **GSE295703** — Chau TN, Ryu KH, Alajoleen R, Bargmann BO, Schiefelbein J, Li S. scCoBench: Benchmarking single cell RNA-seq co-expression using promoter-reporter lines. *bioRxiv.* 2025. [doi:10.1101/2025.05.26.656221](https://doi.org/10.1101/2025.05.26.656221) **GSE296504** — Shi X, et al. (2026). Preprint: [bioRxiv 10.64898/2026.01.13.699360](https://www.biorxiv.org/content/10.64898/2026.01.13.699360v1) **GSE316394** — Molero AE, Devakanmalai GS, Altun YM, Jover-Mengual T, Zhang J, Khan N, Mehler MF. Aberrant medial ganglionic eminence (MGE) GABAergic neurogenesis contributes to Huntington's disease pathogenesis. *Neurobiol Dis.* 2026;221:107297. [doi:10.1016/j.nbd.2026.107297](https://doi.org/10.1016/j.nbd.2026.107297) **GSE319345** — Kanakanui KG, Hantelys F, Hrncir HR, Bombin S, Gracz AD. Multi-gene biomarkers reveal spatial organization and subpopulation-specific damage response in intrahepatic biliary epithelial cells. *bioRxiv.* 2026. [doi:10.64898/2026.02.12.705355](https://doi.org/10.64898/2026.02.12.705355) ### Allelic exclusion catalog **GSE260943** — McGrath JJC, Park J, Troxell CA, Chervin JC, Li L, Kent JR, Changrob S, et al. Mutability and hypermutation antagonize immunoglobulin codon optimality. *Mol Cell.* 2025;85(2):430–444.e6. [doi:10.1016/j.molcel.2024.11.033](https://doi.org/10.1016/j.molcel.2024.11.033) **GSE306378** — Cheng LL, Tang ZF, Li M, Chen JJ, Shang SS, Huang CB. Single-cell sequencing-based analysis of CD4+ T-cell and B-cell heterogeneity in patients with lupus nephritis. *BMC Med Genomics.* 2026;19(1):29. [doi:10.1186/s12920-025-02277-3](https://doi.org/10.1186/s12920-025-02277-3) **GSE285843** — Publication pending (no citation listed on GEO as of March 2026).
提供机构:
valsv
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作