Histology images from uniform tumor regions in TCGA Whole Slide Images (TCGA-UT)
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/3373438
下载链接
链接失效反馈官方服务:
资源简介:
TCGA-UT Dataset Documentation
Quick Links
Dataset on Hugging Face: For users interested in benchmarking foundation models or feature extractors, please visit TCGA-UT on Hugging Face
Original Paper: Universal encoding of pan-cancer histology by deep texture representations
Dataset Overview
The TCGA-UT dataset is a large-scale collection of histopathological image patches from human cancer tissues. It contains 1,608,060 image patches extracted from hematoxylin & eosin (H&E) stained histological samples across 32 different types of solid cancers.
Key Features
Size: Over 1.6 million image patches
Resolution: All patches are standardized to 256 x 256 pixels
Source: Derived from The Cancer Genome Atlas (TCGA) dataset
Quality: Curated by trained pathologists
Coverage: 32 different cancer types
Patient Base: 7,175 patients from 8,736 diagnostic slides
Data Collection Process
Image Source: Whole Slide Images (WSI) were downloaded from the GDC legacy database between December 2016 and June 2017
Expert Annotation: Two trained pathologists selected at least three representative tumor regions per slide
Quality Control: 926 slides were removed due to various quality issues (poor staining, low resolution, focus problems, etc.)
Patch Extraction: 10 patches were randomly cropped at 6 different magnification levels from each annotated region
File Structure
Files are organized using the following format:
Copy
[cancer_type]/[resolution]/[TCGA Barcode]/[region]-[number]-[pixel resolution].jpg
Resolution Key
0: 0.5 μm/pixel
1: 0.6 μm/pixel
2: 0.7 μm/pixel
3: 0.8 μm/pixel
4: 0.9 μm/pixel
5: 1.0 μm/pixel
License
Non-Commercial Use: CC-BY-NC-SA 4.0
Commercial Use: Please contact ishum-prm@m.u-tokyo.ac.jp for licensing
Citation
If you use this dataset in your research, please cite:
Copy
Komura, D., et al. (2022). Universal encoding of pan-cancer histology by deep texture representations.
Cell Reports 38, 110424. https://doi.org/10.1016/j.celrep.2022.110424
For Model Benchmarking
If you're interested in using this dataset for benchmarking foundation models or feature extractors, we recommend accessing the dataset through the Hugging Face Hub at dakomura/tcga-ut. The Hugging Face version provides:
Predefined train/validation/test splits (both internal and external facility-based splits)
Ready-to-use benchmarking framework for foundation models
WebDataset format support for efficient data loading
Example implementations for state-of-the-art model evaluation
创建时间:
2025-02-07



