PABannier/HistoAtlas
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/PABannier/HistoAtlas
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-nc-4.0
size_categories:
- 1K<n<10K
task_categories:
- tabular-classification
- tabular-regression
tags:
- biology
- medical
- cancer
- pathology
- histology
- computational-pathology
- digital-pathology
- TCGA
- pan-cancer
- morphometrics
- cell-segmentation
- survival-analysis
- biomarkers
- oncology
pretty_name: "HistoAtlas: Pan-Cancer Histomics from TCGA"
configs:
- config_name: default
data_files:
- split: data
path: data.parquet
---
<div align="center">
# HistoAtlas: Pan-Cancer Quantitative Histomics
### 38 interpretable morphometric features from 6,745 TCGA diagnostic H&E slides across 21 solid-tumor cancer types
[](https://arxiv.org/abs/2603.16587)
[](https://histoatlas.com)
[](https://creativecommons.org/licenses/by-nc/4.0/)
[](https://github.com/histoatlas/histoatlas)
<img src="images/pipeline_overview.png" alt="HistoAtlas Pipeline" width="800"/>
*From raw H&E whole-slide images to quantitative histomics: tissue and cell segmentation, compartment-resolved feature extraction, pan-cancer statistical analysis, and an interactive web atlas.*
</div>
---
## Overview
**HistoAtlas** is a pan-cancer computational histopathology atlas that quantifies tumor morphology from routine H&E-stained diagnostic whole-slide images. This dataset contains **38 interpretable, compartment-resolved histomic features** extracted from **6,745 TCGA slides** spanning **21 solid-tumor cancer types**, representing 6,745 unique patients.
Each feature captures a specific, biologically interpretable aspect of tumor architecture: tissue composition, cell densities, nuclear morphology, spatial organization of immune and stromal cells, and intra-tumoral heterogeneity. Features are computed from automated tissue and cell segmentation at single-cell resolution, then aggregated at the slide level.
The full precomputed statistical analysis (survival associations, molecular correlations, mutation associations, morphological clusters) is available through the [interactive web atlas](https://histoatlas.com) and the companion [arXiv paper](https://arxiv.org/abs/2603.16587).
---
## Dataset Description
### Source Data
Formalin-fixed, paraffin-embedded (FFPE) H&E-stained diagnostic whole-slide images were obtained from [The Cancer Genome Atlas (TCGA)](https://portal.gdc.cancer.gov/) via the Genomic Data Commons (GDC) portal. One slide per patient was retained (primary tumor diagnostic slide with the largest tissue area), yielding 6,745 slides across 6,745 unique patients.
**Inclusion criteria:**
- Viable tissue area above 1 mm^2
- No severe processing artifacts (pen marks covering >20% of tissue, out-of-focus regions)
- Essential clinical metadata available (vital status, follow-up time)
**Twelve additional TCGA cancer types were excluded** because their dominant cell morphologies fall outside the training domain of the cell detection model.
### Segmentation Pipeline
Feature extraction used a two-stage segmentation pipeline:
1. **Tissue segmentation**: A CellViT-inspired architecture ([Horst et al., 2024](https://doi.org/10.1016/j.media.2024.103143)) with a [Phikon](https://arxiv.org/abs/2310.07033) self-supervised ViT-B backbone, trained on the [PanopTILs](https://doi.org/10.1016/j.media.2024.103191) crowdsourced annotation dataset. Inference at 0.5 um/px on 224x224 pixel tiles classified each region into five effective tissue compartments: cancerous epithelium, stroma, necrosis, normal epithelium, and blood.
2. **Cell segmentation and classification**: The [HistoPLUS model](https://arxiv.org/abs/2508.09926) detected and classified individual cells into nine morphological types: tumor cells, lymphocytes, fibroblasts, plasmocytes, neutrophils, eosinophils, red blood cells, apoptotic bodies, and mitotic figures. Inference at 40x magnification (0.25 um/px) with overlap deduplication via a union-find algorithm.
### Spatial Feature Computation
All spatial features were computed on compartment masks resampled to 8 um/px. Five spatial bands were defined using the signed Euclidean distance transform from the tumor boundary:
| Band | Definition | Distance |
|------|-----------|----------|
| Tumor front | Outer rim of tumor | 0-50 um inside tumor boundary |
| Tumor core | Deep tumor interior | >50 um inside |
| Peritumoral stroma (near) | Stroma adjacent to tumor | 0-50 um outside |
| Peritumoral stroma (far) | Distant stroma | 50-200 um outside |
| Necrosis ring | Perinecrotic zone | 0-100 um from necrosis |
<div align="center">
<img src="images/fig6_spatial_explainability.png" alt="Spatial explainability" width="700"/>
*Spatial interpretability: tissue compartment maps, top-scoring tiles with cell-type overlays, and the link between spatial features and survival associations.*
</div>
---
## Cancer Types
| Code | Cancer Type | N slides |
|------|------------|----------|
| BRCA | Breast invasive carcinoma | 1,037 |
| LUAD | Lung adenocarcinoma | 511 |
| THCA | Thyroid carcinoma | 473 |
| HNSC | Head and neck squamous cell carcinoma | 471 |
| UCEC | Uterine corpus endometrial carcinoma | 459 |
| COAD | Colon adenocarcinoma | 441 |
| BLCA | Bladder urothelial carcinoma | 417 |
| STAD | Stomach adenocarcinoma | 400 |
| LIHC | Liver hepatocellular carcinoma | 365 |
| LUSC | Lung squamous cell carcinoma | 357 |
| PRAD | Prostate adenocarcinoma | 353 |
| CESC | Cervical squamous cell carcinoma | 279 |
| ACC | Adrenocortical carcinoma | 227 |
| THYM | Thymoma | 180 |
| ESCA | Esophageal carcinoma | 158 |
| READ | Rectum adenocarcinoma | 157 |
| PAAD | Pancreatic adenocarcinoma | 146 |
| OV | Ovarian serous cystadenocarcinoma | 107 |
| UCS | Uterine carcinosarcoma | 87 |
| MESO | Mesothelioma | 82 |
| CHOL | Cholangiocarcinoma | 38 |
| **Total** | | **6,745** |
---
## Features
Each slide has 51 `*_value` columns (quantitative measurements) and 51 matching `*_status` columns (QC flags: `"ok"`, `"warn"`, or `"fail"`). The 38 core histomic features are organized into five categories:
### Tissue Composition (7 features)
| Feature | Description |
|---------|-------------|
| `tumor_area_fraction` | Fraction of tissue area classified as cancerous epithelium |
| `stroma_area_fraction` | Fraction of tissue area classified as stroma |
| `normal_epithelium_area_fraction` | Fraction classified as normal epithelium |
| `tumor_front_fraction` | Proportion of tumor area within the 0-50 um front band |
| `largest_tumor_component_share` | Share of total tumor area in the largest connected component |
| `tumor_region_solidity` | Convex hull solidity of the tumor mask |
| `tissue_coverage` | Fraction of the slide covered by tissue |
### Cell Densities (6 features)
| Feature | Description |
|---------|-------------|
| `intratumoral_lymphocyte_density` | Lymphocyte count per mm^2 within tumor epithelium |
| `stromal_lymphocyte_density` | Lymphocyte count per mm^2 within stroma |
| `intratumoral_cancer_cell_density` | Cancer cell count per mm^2 within tumor |
| `fibroblast_density_stroma` | Fibroblast count per mm^2 within stroma |
| `intratumoral_eosinophil_density` | Eosinophil count per mm^2 within tumor |
| `intratumoral_neutrophil_density` | Neutrophil count per mm^2 within tumor |
### Nuclear Morphology and Kinetics (8 features)
| Feature | Description |
|---------|-------------|
| `tumor_nuclear_area_median` | Median nuclear area of tumor cells (um^2) |
| `tumor_nuclear_eccentricity_median` | Median nuclear eccentricity (0 = circle, 1 = line) |
| `tumor_nuclear_irregularity_median` | Median nuclear contour irregularity |
| `tumor_nuclear_irregularity_iqr` | IQR of nuclear irregularity (morphological heterogeneity) |
| `tumor_pleomorphism_index` | Composite nuclear pleomorphism score |
| `mitotic_index_tumor` | Mitotic figure count per mm^2 of tumor |
| `apoptotic_index_tumor` | Apoptotic body count per mm^2 of tumor |
| `apoptosis_mitosis_ratio_tumor` | Ratio of apoptotic to mitotic events |
### Spatial Organization (18 features)
| Feature | Description |
|---------|-------------|
| `lymphocyte_infiltration_ratio_front` | Lymphocyte density ratio: tumor front vs. core |
| `myeloid_infiltration_ratio_front` | Myeloid cell density ratio: tumor front vs. core |
| `tumor_lymphocyte_nn_distance_front` | Mean nearest-neighbor distance from tumor cells to lymphocytes at the front |
| `tumor_fibroblast_coupling_front` | Fibroblast density ratio at the tumor front |
| `tumor_stroma_interface_density` | Length of tumor-stroma boundary per unit tumor area |
| `interface_normalized_immune_pressure` | Immune cell density normalized by interface length |
| `invasion_depth_p75` | 75th percentile of tumor invasion depth (um) |
| `peritumoral_immune_richness` | Shannon diversity of immune cell types in peritumoral stroma |
| `peritumoral_fibroblast_enrichment` | Fibroblast enrichment in peritumoral vs. distal stroma |
| `immune_desert_fraction` | Fraction of tumor area devoid of immune cells |
| `deep_intratumoral_lymphocyte_fraction` | Fraction of intratumoral lymphocytes in tumor core (>50 um) |
| `fibroblast_lymphocyte_proximity_stroma` | Mean distance between fibroblasts and lymphocytes in stroma |
| `intratumoral_myeloid_lymphoid_tilt` | Log-ratio of myeloid to lymphoid cells within tumor |
| `stromal_inflammatory_tilt` | Log-ratio of inflammatory to fibroblast cells in stroma |
| `eosinophil_neutrophil_ratio_peritumoral` | Eosinophil-to-neutrophil ratio in peritumoral stroma |
| `perinecrotic_lymphocyte_enrichment` | Lymphocyte enrichment near necrosis |
| `perinecrotic_neutrophil_enrichment` | Neutrophil enrichment near necrosis |
| `perinecrotic_myeloid_tilt` | Myeloid-to-lymphoid tilt near necrosis |
### Spatial Heterogeneity (3 features)
| Feature | Description |
|---------|-------------|
| `lymphocyte_density_heterogeneity_tumor` | Coefficient of variation of lymphocyte density across tumor tiles |
| `tumor_cell_density_heterogeneity` | Coefficient of variation of cancer cell density across tumor tiles |
| `stromal_cellularity_heterogeneity` | Coefficient of variation of total cellularity across stroma tiles |
### Additional Features
| Feature | Description |
|---------|-------------|
| `necrosis_in_tumor_fraction` | Fraction of necrosis within tumor regions |
| `necrosis_heterogeneity` | Spatial heterogeneity of necrosis distribution |
| `necrosis_rbc_enrichment` | RBC enrichment near necrosis (hemorrhagic necrosis) |
| `necrosis_contact_fraction_stroma` | Fraction of necrosis boundary in contact with stroma |
| `tumor_contact_fraction_stroma` | Fraction of tumor boundary in contact with stroma |
| `tumor_contact_fraction_necrosis` | Fraction of tumor boundary in contact with necrosis |
| `tumor_contact_fraction_normal` | Fraction of tumor boundary in contact with normal epithelium |
| `tumor_necrosis_proximity` | Mean distance from tumor to nearest necrosis |
| `artifact_fraction` | Fraction of tissue area flagged as artifact |
---
## Key Findings
<div align="center">
<img src="images/fig1_atlas_overview.png" alt="Atlas overview" width="800"/>
*HistoAtlas overview: (a) pipeline, (b) feature correlation structure, (c) UMAP embedding colored by cancer type, (d) morphological cluster composition, (e) cluster feature profiles.*
</div>
<br/>
<div align="center">
<img src="images/fig2_survival_correlations.png" alt="Survival and molecular correlations" width="650"/>
*Survival and molecular associations: (a) pan-cancer forest plot of immune density hazard ratios, (b) Kaplan-Meier curves in BRCA, (c-d) gene expression correlations.*
</div>
<br/>
<div align="center">
<img src="images/fig3_pathway_associations.png" alt="Pathway associations" width="650"/>
*Pathway-level associations: (a) heatmap of mean Spearman correlations between Hallmark pathways and histomic features, (b) distribution of significant associations across molecular data types.*
</div>
---
## Usage
```python
import pandas as pd
# Load the dataset
df = pd.read_parquet("hf://datasets/PABannier/HistoAtlas/data.parquet")
print(f"Shape: {df.shape}") # (6745, 106)
print(f"Cancer types: {df['cancer_type'].nunique()}") # 21
# Get feature values for a specific cancer type
brca = df[df["cancer_type"] == "BRCA"]
print(f"BRCA slides: {len(brca)}") # 1037
# Extract all value columns (quantitative measurements)
value_cols = [c for c in df.columns if c.endswith("_value")]
features = df[["cancer_type", "slide_name"] + value_cols]
# Check QC status for a feature
ok_mask = df["intratumoral_lymphocyte_density_status"] == "ok"
print(f"Slides with OK lymphocyte density: {ok_mask.sum()}")
```
### With Hugging Face `datasets`
```python
from datasets import load_dataset
dataset = load_dataset("PABannier/HistoAtlas")
df = dataset["data"].to_pandas()
```
### Linking to TCGA Clinical and Molecular Data
Slide names follow the TCGA barcode convention. Extract the case barcode (first 12 characters) to match with clinical, genomic, and transcriptomic data:
```python
# Extract TCGA case barcode for linking
df["case_id"] = df["slide_name"].str[:12]
# Now join with TCGA-CDR clinical data, MC3 mutations, RNA-seq, etc.
```
---
## Preprocessing Notes
The values in this dataset are **raw (untransformed)** measurements. The statistical analyses in the paper applied the following preprocessing:
1. **Log-transform**: 22 features with heavy right-skew were transformed using log(1 + x)
2. **Winsorization**: All features clipped at the 0.5th and 99.5th percentiles (pan-cohort)
3. **Z-score standardization**: Zero mean, unit variance (scope varies by analysis)
The `*_status` columns encode per-slide QC flags:
- `"ok"`: Feature passed all quality checks
- `"warn"`: Minor quality concern (e.g., small compartment area)
- `"fail"`: Feature unreliable for this slide (e.g., insufficient tumor area)
---
## Validation
Independent validation was performed on 1,095 CPTAC slides from 817 cases across five matched cancer types (BRCA, COAD, LUAD, LUSC, UCEC):
- 4/5 cancer-type matches passed feature-level concordance
- Prespecified histomic-transcriptomic associations replicated in direction (10/10) and significance (9/10)
- Matched mRNA-protein associations were directionally concordant in 98.8% of doubly-significant pairs
---
## Citation
If you use this dataset, please cite:
```bibtex
@article{bannier2025histoatlas,
title={HistoAtlas: a pan-cancer histomics atlas linking quantitative tissue morphology to transcriptomic programs, somatic alterations, and clinical outcomes},
author={Bannier, Pierre-Antoine},
journal={arXiv preprint arXiv:2603.16587},
year={2025},
url={https://arxiv.org/abs/2603.16587}
}
```
---
## Links
- **Interactive Web Atlas**: [histoatlas.com](https://histoatlas.com)
- **Paper**: [arXiv:2603.16587](https://arxiv.org/abs/2603.16587)
- **Code**: [github.com/histoatlas/histoatlas](https://github.com/histoatlas/histoatlas)
---
## License
This dataset is released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). The underlying TCGA whole-slide images are governed by [GDC Data Use Policies](https://gdc.cancer.gov/access-data/data-access-policies).
提供机构:
PABannier



