valsv/scrna-coregulation-benchmark
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/valsv/scrna-coregulation-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-regression
tags:
- single-cell
- scRNA-seq
- gene-expression
- normalization
- benchmark
- coregulation
- coexpression
- anndata
size_categories:
- 100K<n<1M
pretty_name: scRNA-seq Coregulation Benchmark
---
# scRNA-seq Coregulation Benchmark
A benchmark for evaluating whether single-cell RNA-seq normalization methods preserve known gene-gene correlation structure. It provides two complementary ground-truth catalogs:
1. **Promoter-reporter catalog** — Datasets where a fluorescent reporter (GFP/DsRed) is driven by a known gene's promoter. The reporter and its target gene should be *positively correlated*.
2. **Allelic exclusion catalog** — PBMC and B cell datasets where immunoglobulin light chain allelic exclusion (IGKC vs IGLC) provides an expected *negative correlation*.
Together, these test both directions of the correlation spectrum: a good normalization method should recover positive coregulation where it exists and preserve anti-correlation where biology demands it.
## Quick start
```python
from huggingface_hub import hf_hub_download
import anndata as ad
# Promoter-reporter example
path = hf_hub_download(
repo_id="valsv/scrna-coregulation-benchmark",
filename="promoter_reporter/GSE316394_BACHD_1.h5ad",
repo_type="dataset",
)
adata = ad.read_h5ad(path)
reporter = adata.uns["reporters"]["eGFP"]
reporter["target_gene_symbol"] # "Dlx1" — the gene whose promoter drives eGFP
# Allelic exclusion example
path = hf_hub_download(
repo_id="valsv/scrna-coregulation-benchmark",
filename="allelic_exclusion/GSE306378_N_rep1.h5ad",
repo_type="dataset",
)
adata = ad.read_h5ad(path)
pair = adata.uns["exclusion_pairs"]["IGKC_vs_IGLC2"]
pair["gene_a_symbol"] # "IGKC"
pair["gene_b_symbol"] # "IGLC2"
```
## Repository structure
```
promoter_reporter/
GSE160772.h5ad
GSE181864.h5ad
GSE198556_*.h5ad (4 files)
GSE229976_*.h5ad (2 files)
GSE295703_*.h5ad (3 files)
GSE296504.h5ad
GSE316394_*.h5ad (4 files)
GSE319345_*.h5ad (4 files)
allelic_exclusion/
GSE260943_*.h5ad (3 files)
GSE285843_*.h5ad (6 files)
GSE306378_*.h5ad (6 files)
```
## File format
Each `.h5ad` file is one sample (one 10x or Parse capture). Files are named `{series_id}_{sample_suffix}.h5ad`.
### `X` — Count matrix
Sparse CSR, dtype `int32`. Raw UMI counts (not normalized). Rows are cells, columns are genes.
### `var` — Gene annotations
| Field | Description |
|-------|-------------|
| `var_names` (index) | Gene symbols (mouse), TAIR locus IDs (Arabidopsis), or gene symbols (human) |
| `gene_id` | Ensembl or TAIR ID |
### `obs` — Cell metadata
| Field | Type | Description |
|-------|------|-------------|
| `total_counts` | int | Total UMI per cell |
| `n_genes` | int | Number of genes with at least one count |
### `uns` — Sample metadata
| Field | Description |
|-------|-------------|
| `sample_id` | GEO sample accession |
| `series_id` | GEO series accession |
| `species` | Species |
| `tissue` | Tissue or cell population |
| `platform` | Sequencing platform/chemistry |
Promoter-reporter files additionally have `uns["reporters"]` and allelic exclusion files have `uns["exclusion_pairs"]` (see below).
## Promoter-reporter catalog
20 harmonized scRNA-seq h5ad files where a fluorescent reporter gene is driven by a known gene's promoter, providing ground-truth positive coregulation at single-cell resolution.
### Reporter metadata — `uns["reporters"]`
Dict keyed by the reporter's name in `var_names`:
```python
{"eGFP": {"target_gene_symbol": "Pdgfrb",
"target_gene_id": "ENSMUSG00000024620.13",
"construct": "Pdgfrb-BAC-eGFP"}}
```
### Evaluation
For each sample and reporter, compute:
1. **Target correlation**: Pearson r between the reporter and its target gene (expected positive)
2. **Background correlations**: Pearson r between the reporter and N random non-reporter genes
The target correlation should be substantially higher than the median background correlation.
```python
import numpy as np
from scipy.stats import pearsonr
reporter_name = "eGFP"
target_name = adata.uns["reporters"][reporter_name]["target_gene_symbol"]
totals = np.asarray(adata.X.sum(axis=1)).ravel()
reporter_norm = np.log10(1e4 * adata[:, reporter_name].X.toarray().ravel() / totals + 1)
target_norm = np.log10(1e4 * adata[:, target_name].X.toarray().ravel() / totals + 1)
target_r = pearsonr(reporter_norm, target_norm)[0]
rng = np.random.default_rng(42)
bg_genes = rng.choice(
[g for g in adata.var_names if g != reporter_name and g != target_name],
size=500, replace=False,
)
bg_cors = [pearsonr(reporter_norm, np.log10(1e4 * adata[:, g].X.toarray().ravel() / totals + 1))[0]
for g in bg_genes]
print(f"Target r: {target_r:.3f}, Background median: {np.median(bg_cors):.3f}")
```
### Datasets
#### Mouse (17 files)
| Series | Files | Reporter | Target gene | Tissue | Construct | Platform | Cells |
|--------|-------|----------|-------------|--------|-----------|----------|-------|
| GSE160772 | 1 | eGFP | Pdgfrb | Endometrium mesenchyme | BAC transgene | 10x v2 | 6,514 |
| GSE198556 | 4 | eGFP | Pdgfrb | Endometrium (injury time-course) | BAC transgene | 10x v3 | 49,723 |
| GSE181864 | 1 | eGFP | Rorc | Large intestine LP | Knockin | 10x v3 | 9,107 |
| GSE229976 | 2 | eGFP | Il23r | Small intestine | Knockin | 10x v3 | 27,314 |
| GSE296504 | 1 | eGFP + DsRed | Cx3cr1, Cspg4 | P15 eardrum | Knockin + transgene | 10x v3.1 | 4,548 |
| GSE316394 | 4 | eGFP | Dlx1 | E12.5 MGE | BAC transgene | 10x v3.1 | 42,755 |
| GSE319345 | 4 | eGFP | Sox9 | Liver (BDL model) | BAC transgene | Parse WT v1 | 19,819 |
#### Arabidopsis (3 files)
| Series | Files | Reporter | Target gene | Tissue | Construct | Platform | Cells |
|--------|-------|----------|-------------|--------|-----------|----------|-------|
| GSE295703 | 3 | GFP | WER, CORTEX, SCR | Root | Promoter fusion | 10x v3 | 32,078 |
### Notes
- All 16 standard mouse files share the same 78,335 genes in the same order. GSE296504 has one additional gene (DsRed, 78,336 total). The three Arabidopsis files have 32,834 genes each.
- Mouse gene references are from Ensembl GRCm39, augmented with eGFP (and DsRed for GSE296504).
- Construct types: knockin (reporter inserted at the endogenous locus), BAC transgene (reporter in a bacterial artificial chromosome), promoter fusion (reporter driven by a cloned proximal promoter).
## Allelic exclusion catalog
15 human scRNA-seq h5ad files for benchmarking using immunoglobulin light chain allelic exclusion. Each B cell commits to either kappa (IGKC) or lambda (IGLC2/IGLC3) light chain expression — never both — providing an expected anti-correlation signal.
### Exclusion pair metadata — `uns["exclusion_pairs"]`
```python
{"IGKC_vs_IGLC2": {"gene_a_symbol": "IGKC",
"gene_a_id": "ENSG00000211592",
"gene_b_symbol": "IGLC2",
"gene_b_id": "ENSG00000211677",
"mechanism": "Immunoglobulin light chain allelic exclusion"}}
```
### Evaluation
In mixed populations (PBMC), most cells express neither light chain. Filter to B cells first to avoid Simpson's paradox:
```python
igkc = adata[:, "IGKC"].X.toarray().ravel()
iglc2 = adata[:, "IGLC2"].X.toarray().ravel()
iglc3 = adata[:, "IGLC3"].X.toarray().ravel()
b_cell_mask = (igkc > 0) | (iglc2 > 0) | (iglc3 > 0)
adata_b = adata[b_cell_mask]
```
Then compute target correlation (expected negative) vs. background, excluding all immunoglobulin genes (IGK\*, IGL\*, IGH\*) from the background pool.
### Datasets
| Series | Files | Condition | Tissue | Platform | Cells |
|--------|-------|-----------|--------|----------|-------|
| GSE306378 | 6 | 3 healthy + 3 SLE | PBMC | 10x | 78,851 |
| GSE285843 | 6 | healthy (3 donors x 2 platforms) | PBMC | 10x + Parse | 72,080 |
| GSE260943 | 3 | healthy (3 donors) | Tonsil B cells | 10x | 47,978 |
### Notes
- All 15 files share the same 33,694 genes (GRCh38, Cell Ranger reference).
- GSE260943 samples are sorted tonsil B cells — B cell filtering is optional.
- GSE306378 SLE samples have elevated B cell / plasma cell fractions.
## Citations
If you use this benchmark, please cite the original studies that generated the data.
### Promoter-reporter catalog
**GSE160772** — Kirkwood PM, Gibson DA, Smith JR, Wilson-Kanamori JR, Kelepouri O, Esnal-Zufiaurre A, Dobie R, Henderson NC, Saunders PTK. Single-cell RNA sequencing redefines the mesenchymal cell landscape of mouse endometrium. *FASEB J.* 2021;35:e21285. [doi:10.1096/fj.202002123R](https://doi.org/10.1096/fj.202002123R)
**GSE198556** — Kirkwood PM, Gibson DA, Shaw I, Dobie R, Kelepouri O, Henderson NC, Saunders PTK. Single-cell RNA sequencing and lineage tracing confirm mesenchyme to epithelial transformation (MET) contributes to repair of the endometrium at menstruation. *eLife.* 2022;11:e77663. [doi:10.7554/eLife.77663](https://doi.org/10.7554/eLife.77663)
**GSE181864** — Zhou W, Zhou L, Zhou J, Chu C, Zhang C, Sockolow RE, Eberl G, Sonnenberg GF. ZBTB46 defines and regulates ILC3s that protect the intestine. *Nature.* 2022;609(7925):159–165. [doi:10.1038/s41586-022-04934-4](https://doi.org/10.1038/s41586-022-04934-4)
**GSE229976** — Ahmed A, Joseph AM, Zhou J, Horn V, Uddin J, Lyu M, Goc J, et al. CTLA-4-expressing ILC3s restrain interleukin-23-mediated inflammation. *Nature.* 2024;630:976–983. [doi:10.1038/s41586-024-07537-3](https://doi.org/10.1038/s41586-024-07537-3)
**GSE295703** — Chau TN, Ryu KH, Alajoleen R, Bargmann BO, Schiefelbein J, Li S. scCoBench: Benchmarking single cell RNA-seq co-expression using promoter-reporter lines. *bioRxiv.* 2025. [doi:10.1101/2025.05.26.656221](https://doi.org/10.1101/2025.05.26.656221)
**GSE296504** — Shi X, et al. (2026). Preprint: [bioRxiv 10.64898/2026.01.13.699360](https://www.biorxiv.org/content/10.64898/2026.01.13.699360v1)
**GSE316394** — Molero AE, Devakanmalai GS, Altun YM, Jover-Mengual T, Zhang J, Khan N, Mehler MF. Aberrant medial ganglionic eminence (MGE) GABAergic neurogenesis contributes to Huntington's disease pathogenesis. *Neurobiol Dis.* 2026;221:107297. [doi:10.1016/j.nbd.2026.107297](https://doi.org/10.1016/j.nbd.2026.107297)
**GSE319345** — Kanakanui KG, Hantelys F, Hrncir HR, Bombin S, Gracz AD. Multi-gene biomarkers reveal spatial organization and subpopulation-specific damage response in intrahepatic biliary epithelial cells. *bioRxiv.* 2026. [doi:10.64898/2026.02.12.705355](https://doi.org/10.64898/2026.02.12.705355)
### Allelic exclusion catalog
**GSE260943** — McGrath JJC, Park J, Troxell CA, Chervin JC, Li L, Kent JR, Changrob S, et al. Mutability and hypermutation antagonize immunoglobulin codon optimality. *Mol Cell.* 2025;85(2):430–444.e6. [doi:10.1016/j.molcel.2024.11.033](https://doi.org/10.1016/j.molcel.2024.11.033)
**GSE306378** — Cheng LL, Tang ZF, Li M, Chen JJ, Shang SS, Huang CB. Single-cell sequencing-based analysis of CD4+ T-cell and B-cell heterogeneity in patients with lupus nephritis. *BMC Med Genomics.* 2026;19(1):29. [doi:10.1186/s12920-025-02277-3](https://doi.org/10.1186/s12920-025-02277-3)
**GSE285843** — Publication pending (no citation listed on GEO as of March 2026).
提供机构:
valsv



