five

PABannier/HistoAtlas

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/PABannier/HistoAtlas
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-nc-4.0 size_categories: - 1K<n<10K task_categories: - tabular-classification - tabular-regression tags: - biology - medical - cancer - pathology - histology - computational-pathology - digital-pathology - TCGA - pan-cancer - morphometrics - cell-segmentation - survival-analysis - biomarkers - oncology pretty_name: "HistoAtlas: Pan-Cancer Histomics from TCGA" configs: - config_name: default data_files: - split: data path: data.parquet --- <div align="center"> # HistoAtlas: Pan-Cancer Quantitative Histomics ### 38 interpretable morphometric features from 6,745 TCGA diagnostic H&E slides across 21 solid-tumor cancer types [![arXiv](https://img.shields.io/badge/arXiv-2603.16587-b31b1b.svg)](https://arxiv.org/abs/2603.16587) [![Website](https://img.shields.io/badge/Web_Atlas-histoatlas.com-blue.svg)](https://histoatlas.com) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) [![GitHub](https://img.shields.io/badge/GitHub-histoatlas-181717.svg?logo=github)](https://github.com/histoatlas/histoatlas) <img src="images/pipeline_overview.png" alt="HistoAtlas Pipeline" width="800"/> *From raw H&E whole-slide images to quantitative histomics: tissue and cell segmentation, compartment-resolved feature extraction, pan-cancer statistical analysis, and an interactive web atlas.* </div> --- ## Overview **HistoAtlas** is a pan-cancer computational histopathology atlas that quantifies tumor morphology from routine H&E-stained diagnostic whole-slide images. This dataset contains **38 interpretable, compartment-resolved histomic features** extracted from **6,745 TCGA slides** spanning **21 solid-tumor cancer types**, representing 6,745 unique patients. Each feature captures a specific, biologically interpretable aspect of tumor architecture: tissue composition, cell densities, nuclear morphology, spatial organization of immune and stromal cells, and intra-tumoral heterogeneity. Features are computed from automated tissue and cell segmentation at single-cell resolution, then aggregated at the slide level. The full precomputed statistical analysis (survival associations, molecular correlations, mutation associations, morphological clusters) is available through the [interactive web atlas](https://histoatlas.com) and the companion [arXiv paper](https://arxiv.org/abs/2603.16587). --- ## Dataset Description ### Source Data Formalin-fixed, paraffin-embedded (FFPE) H&E-stained diagnostic whole-slide images were obtained from [The Cancer Genome Atlas (TCGA)](https://portal.gdc.cancer.gov/) via the Genomic Data Commons (GDC) portal. One slide per patient was retained (primary tumor diagnostic slide with the largest tissue area), yielding 6,745 slides across 6,745 unique patients. **Inclusion criteria:** - Viable tissue area above 1 mm^2 - No severe processing artifacts (pen marks covering >20% of tissue, out-of-focus regions) - Essential clinical metadata available (vital status, follow-up time) **Twelve additional TCGA cancer types were excluded** because their dominant cell morphologies fall outside the training domain of the cell detection model. ### Segmentation Pipeline Feature extraction used a two-stage segmentation pipeline: 1. **Tissue segmentation**: A CellViT-inspired architecture ([Horst et al., 2024](https://doi.org/10.1016/j.media.2024.103143)) with a [Phikon](https://arxiv.org/abs/2310.07033) self-supervised ViT-B backbone, trained on the [PanopTILs](https://doi.org/10.1016/j.media.2024.103191) crowdsourced annotation dataset. Inference at 0.5 um/px on 224x224 pixel tiles classified each region into five effective tissue compartments: cancerous epithelium, stroma, necrosis, normal epithelium, and blood. 2. **Cell segmentation and classification**: The [HistoPLUS model](https://arxiv.org/abs/2508.09926) detected and classified individual cells into nine morphological types: tumor cells, lymphocytes, fibroblasts, plasmocytes, neutrophils, eosinophils, red blood cells, apoptotic bodies, and mitotic figures. Inference at 40x magnification (0.25 um/px) with overlap deduplication via a union-find algorithm. ### Spatial Feature Computation All spatial features were computed on compartment masks resampled to 8 um/px. Five spatial bands were defined using the signed Euclidean distance transform from the tumor boundary: | Band | Definition | Distance | |------|-----------|----------| | Tumor front | Outer rim of tumor | 0-50 um inside tumor boundary | | Tumor core | Deep tumor interior | >50 um inside | | Peritumoral stroma (near) | Stroma adjacent to tumor | 0-50 um outside | | Peritumoral stroma (far) | Distant stroma | 50-200 um outside | | Necrosis ring | Perinecrotic zone | 0-100 um from necrosis | <div align="center"> <img src="images/fig6_spatial_explainability.png" alt="Spatial explainability" width="700"/> *Spatial interpretability: tissue compartment maps, top-scoring tiles with cell-type overlays, and the link between spatial features and survival associations.* </div> --- ## Cancer Types | Code | Cancer Type | N slides | |------|------------|----------| | BRCA | Breast invasive carcinoma | 1,037 | | LUAD | Lung adenocarcinoma | 511 | | THCA | Thyroid carcinoma | 473 | | HNSC | Head and neck squamous cell carcinoma | 471 | | UCEC | Uterine corpus endometrial carcinoma | 459 | | COAD | Colon adenocarcinoma | 441 | | BLCA | Bladder urothelial carcinoma | 417 | | STAD | Stomach adenocarcinoma | 400 | | LIHC | Liver hepatocellular carcinoma | 365 | | LUSC | Lung squamous cell carcinoma | 357 | | PRAD | Prostate adenocarcinoma | 353 | | CESC | Cervical squamous cell carcinoma | 279 | | ACC | Adrenocortical carcinoma | 227 | | THYM | Thymoma | 180 | | ESCA | Esophageal carcinoma | 158 | | READ | Rectum adenocarcinoma | 157 | | PAAD | Pancreatic adenocarcinoma | 146 | | OV | Ovarian serous cystadenocarcinoma | 107 | | UCS | Uterine carcinosarcoma | 87 | | MESO | Mesothelioma | 82 | | CHOL | Cholangiocarcinoma | 38 | | **Total** | | **6,745** | --- ## Features Each slide has 51 `*_value` columns (quantitative measurements) and 51 matching `*_status` columns (QC flags: `"ok"`, `"warn"`, or `"fail"`). The 38 core histomic features are organized into five categories: ### Tissue Composition (7 features) | Feature | Description | |---------|-------------| | `tumor_area_fraction` | Fraction of tissue area classified as cancerous epithelium | | `stroma_area_fraction` | Fraction of tissue area classified as stroma | | `normal_epithelium_area_fraction` | Fraction classified as normal epithelium | | `tumor_front_fraction` | Proportion of tumor area within the 0-50 um front band | | `largest_tumor_component_share` | Share of total tumor area in the largest connected component | | `tumor_region_solidity` | Convex hull solidity of the tumor mask | | `tissue_coverage` | Fraction of the slide covered by tissue | ### Cell Densities (6 features) | Feature | Description | |---------|-------------| | `intratumoral_lymphocyte_density` | Lymphocyte count per mm^2 within tumor epithelium | | `stromal_lymphocyte_density` | Lymphocyte count per mm^2 within stroma | | `intratumoral_cancer_cell_density` | Cancer cell count per mm^2 within tumor | | `fibroblast_density_stroma` | Fibroblast count per mm^2 within stroma | | `intratumoral_eosinophil_density` | Eosinophil count per mm^2 within tumor | | `intratumoral_neutrophil_density` | Neutrophil count per mm^2 within tumor | ### Nuclear Morphology and Kinetics (8 features) | Feature | Description | |---------|-------------| | `tumor_nuclear_area_median` | Median nuclear area of tumor cells (um^2) | | `tumor_nuclear_eccentricity_median` | Median nuclear eccentricity (0 = circle, 1 = line) | | `tumor_nuclear_irregularity_median` | Median nuclear contour irregularity | | `tumor_nuclear_irregularity_iqr` | IQR of nuclear irregularity (morphological heterogeneity) | | `tumor_pleomorphism_index` | Composite nuclear pleomorphism score | | `mitotic_index_tumor` | Mitotic figure count per mm^2 of tumor | | `apoptotic_index_tumor` | Apoptotic body count per mm^2 of tumor | | `apoptosis_mitosis_ratio_tumor` | Ratio of apoptotic to mitotic events | ### Spatial Organization (18 features) | Feature | Description | |---------|-------------| | `lymphocyte_infiltration_ratio_front` | Lymphocyte density ratio: tumor front vs. core | | `myeloid_infiltration_ratio_front` | Myeloid cell density ratio: tumor front vs. core | | `tumor_lymphocyte_nn_distance_front` | Mean nearest-neighbor distance from tumor cells to lymphocytes at the front | | `tumor_fibroblast_coupling_front` | Fibroblast density ratio at the tumor front | | `tumor_stroma_interface_density` | Length of tumor-stroma boundary per unit tumor area | | `interface_normalized_immune_pressure` | Immune cell density normalized by interface length | | `invasion_depth_p75` | 75th percentile of tumor invasion depth (um) | | `peritumoral_immune_richness` | Shannon diversity of immune cell types in peritumoral stroma | | `peritumoral_fibroblast_enrichment` | Fibroblast enrichment in peritumoral vs. distal stroma | | `immune_desert_fraction` | Fraction of tumor area devoid of immune cells | | `deep_intratumoral_lymphocyte_fraction` | Fraction of intratumoral lymphocytes in tumor core (>50 um) | | `fibroblast_lymphocyte_proximity_stroma` | Mean distance between fibroblasts and lymphocytes in stroma | | `intratumoral_myeloid_lymphoid_tilt` | Log-ratio of myeloid to lymphoid cells within tumor | | `stromal_inflammatory_tilt` | Log-ratio of inflammatory to fibroblast cells in stroma | | `eosinophil_neutrophil_ratio_peritumoral` | Eosinophil-to-neutrophil ratio in peritumoral stroma | | `perinecrotic_lymphocyte_enrichment` | Lymphocyte enrichment near necrosis | | `perinecrotic_neutrophil_enrichment` | Neutrophil enrichment near necrosis | | `perinecrotic_myeloid_tilt` | Myeloid-to-lymphoid tilt near necrosis | ### Spatial Heterogeneity (3 features) | Feature | Description | |---------|-------------| | `lymphocyte_density_heterogeneity_tumor` | Coefficient of variation of lymphocyte density across tumor tiles | | `tumor_cell_density_heterogeneity` | Coefficient of variation of cancer cell density across tumor tiles | | `stromal_cellularity_heterogeneity` | Coefficient of variation of total cellularity across stroma tiles | ### Additional Features | Feature | Description | |---------|-------------| | `necrosis_in_tumor_fraction` | Fraction of necrosis within tumor regions | | `necrosis_heterogeneity` | Spatial heterogeneity of necrosis distribution | | `necrosis_rbc_enrichment` | RBC enrichment near necrosis (hemorrhagic necrosis) | | `necrosis_contact_fraction_stroma` | Fraction of necrosis boundary in contact with stroma | | `tumor_contact_fraction_stroma` | Fraction of tumor boundary in contact with stroma | | `tumor_contact_fraction_necrosis` | Fraction of tumor boundary in contact with necrosis | | `tumor_contact_fraction_normal` | Fraction of tumor boundary in contact with normal epithelium | | `tumor_necrosis_proximity` | Mean distance from tumor to nearest necrosis | | `artifact_fraction` | Fraction of tissue area flagged as artifact | --- ## Key Findings <div align="center"> <img src="images/fig1_atlas_overview.png" alt="Atlas overview" width="800"/> *HistoAtlas overview: (a) pipeline, (b) feature correlation structure, (c) UMAP embedding colored by cancer type, (d) morphological cluster composition, (e) cluster feature profiles.* </div> <br/> <div align="center"> <img src="images/fig2_survival_correlations.png" alt="Survival and molecular correlations" width="650"/> *Survival and molecular associations: (a) pan-cancer forest plot of immune density hazard ratios, (b) Kaplan-Meier curves in BRCA, (c-d) gene expression correlations.* </div> <br/> <div align="center"> <img src="images/fig3_pathway_associations.png" alt="Pathway associations" width="650"/> *Pathway-level associations: (a) heatmap of mean Spearman correlations between Hallmark pathways and histomic features, (b) distribution of significant associations across molecular data types.* </div> --- ## Usage ```python import pandas as pd # Load the dataset df = pd.read_parquet("hf://datasets/PABannier/HistoAtlas/data.parquet") print(f"Shape: {df.shape}") # (6745, 106) print(f"Cancer types: {df['cancer_type'].nunique()}") # 21 # Get feature values for a specific cancer type brca = df[df["cancer_type"] == "BRCA"] print(f"BRCA slides: {len(brca)}") # 1037 # Extract all value columns (quantitative measurements) value_cols = [c for c in df.columns if c.endswith("_value")] features = df[["cancer_type", "slide_name"] + value_cols] # Check QC status for a feature ok_mask = df["intratumoral_lymphocyte_density_status"] == "ok" print(f"Slides with OK lymphocyte density: {ok_mask.sum()}") ``` ### With Hugging Face `datasets` ```python from datasets import load_dataset dataset = load_dataset("PABannier/HistoAtlas") df = dataset["data"].to_pandas() ``` ### Linking to TCGA Clinical and Molecular Data Slide names follow the TCGA barcode convention. Extract the case barcode (first 12 characters) to match with clinical, genomic, and transcriptomic data: ```python # Extract TCGA case barcode for linking df["case_id"] = df["slide_name"].str[:12] # Now join with TCGA-CDR clinical data, MC3 mutations, RNA-seq, etc. ``` --- ## Preprocessing Notes The values in this dataset are **raw (untransformed)** measurements. The statistical analyses in the paper applied the following preprocessing: 1. **Log-transform**: 22 features with heavy right-skew were transformed using log(1 + x) 2. **Winsorization**: All features clipped at the 0.5th and 99.5th percentiles (pan-cohort) 3. **Z-score standardization**: Zero mean, unit variance (scope varies by analysis) The `*_status` columns encode per-slide QC flags: - `"ok"`: Feature passed all quality checks - `"warn"`: Minor quality concern (e.g., small compartment area) - `"fail"`: Feature unreliable for this slide (e.g., insufficient tumor area) --- ## Validation Independent validation was performed on 1,095 CPTAC slides from 817 cases across five matched cancer types (BRCA, COAD, LUAD, LUSC, UCEC): - 4/5 cancer-type matches passed feature-level concordance - Prespecified histomic-transcriptomic associations replicated in direction (10/10) and significance (9/10) - Matched mRNA-protein associations were directionally concordant in 98.8% of doubly-significant pairs --- ## Citation If you use this dataset, please cite: ```bibtex @article{bannier2025histoatlas, title={HistoAtlas: a pan-cancer histomics atlas linking quantitative tissue morphology to transcriptomic programs, somatic alterations, and clinical outcomes}, author={Bannier, Pierre-Antoine}, journal={arXiv preprint arXiv:2603.16587}, year={2025}, url={https://arxiv.org/abs/2603.16587} } ``` --- ## Links - **Interactive Web Atlas**: [histoatlas.com](https://histoatlas.com) - **Paper**: [arXiv:2603.16587](https://arxiv.org/abs/2603.16587) - **Code**: [github.com/histoatlas/histoatlas](https://github.com/histoatlas/histoatlas) --- ## License This dataset is released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). The underlying TCGA whole-slide images are governed by [GDC Data Use Policies](https://gdc.cancer.gov/access-data/data-access-policies).
提供机构:
PABannier
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作