vaaaasss/tcga-ut
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/vaaaasss/tcga-ut
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
dataset_info:
features:
- name: __key__
dtype: string
- name: __url__
dtype: string
- name: jpg
dtype: image
- name: json
struct:
- name: label
dtype: string
splits:
- name: internal_train
num_bytes: 4231024640
num_examples: 190080
- name: internal_valid
num_bytes: 910479360
num_examples: 40770
- name: internal_test
num_bytes: 911656960
num_examples: 40860
- name: external_train
num_bytes: 4291031040
num_examples: 192680
- name: external_valid
num_bytes: 893061120
num_examples: 39670
- name: external_test
num_bytes: 869038080
num_examples: 39360
download_size: 12106291200
configs:
- config_name: internal
data_files:
- split: train
path: data/dataset_internal_train_part*.tar
- split: valid
path: data/dataset_internal_valid_part*.tar
- split: test
path: data/dataset_internal_test_part*.tar
default: true
- config_name: external
data_files:
- split: train
path: data/dataset_external_train_part*.tar
- split: valid
path: data/dataset_external_valid_part*.tar
- split: test
path: data/dataset_external_test_part*.tar
tags:
- histology
- pathology
- webdataset
- image
task_categories:
- image-feature-extraction
- image-classification
---
# Histology images from uniform tumor regions in TCGA Whole Slide Images (TCGA-UT-Internal, TCGA-UT-External)
<div style="text-align: center;">
<img src="logo.webp" width="600" alt="TCGA Histology Dataset Logo">
</div>
This repository provides a benchmarking framework for the TCGA histology image dataset originally published on [Zenodo](https://zenodo.org/records/5889558). It includes predefined train/validation/test splits and example code for foundation model evaluation.
## Task
Classification of 31 different cancer types from tumor histopathological images.
## Original Dataset Description
This dataset contains 1,608,060 image patches of hematoxylin & eosin stained histological samples from various human cancers. The data was collected and processed as follows:
- Source: TCGA dataset from 32 solid cancer types (GDC legacy database, downloaded between December 1, 2016, and June 19, 2017)
- Initial data: 9,662 diagnostic slides from 7,951 patients in SVS format
- Annotation: At least three representative tumor regions were selected as polygons by two trained pathologists
- Quality control: 926 slides were removed due to poor staining, low resolution, out-of-focus issues, absence of cancerous regions, or incorrect cancer types
- Final dataset: 8,736 diagnostic slides from 7,175 patients
- Patch extraction: 10 patches at 0.5 μm/pixel resolution (128 x 128 μm) were randomly cropped from each annotated region
Note: Additional resolution levels are available in the original Zenodo dataset. Please refer to the Zenodo repository for the complete dataset.
TCGA Barcode format (TCGA-XX-XXXX) represents patient ID. For details, see the [TCGA Barcode documentation](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/).
## Updates in This Version
The dataset has been modified and organized for benchmarking purposes:
1. **Label Consolidation**:
- Colon Adenocarcinoma (COAD) and Rectum Adenocarcinoma (READ) have been merged due to their histological similarity
2. **Structured Splits**:
### Internal Split (70:15:15): TCGA-UT-Internal
- Ensures no patient overlap between train, validation, and test sets
- Approximate distribution: 70% train, 15% validation, 15% test
### External Split: TCGA-UT-External
- Separates data based on medical facilities to evaluate cross-institutional generalization
- No facility overlap between train, validation, and test sets
- Maintains similar class distributions across splits
## Dataset Details
### Internal Split: TCGA-UT-Internal
| case | train (patches) | valid (patches) | test (patches) | train (patients) | valid (patients) | test (patients) |
|:-------------------------------------------------------------|----------------:|----------------:|---------------:|-----------------:|-----------------:|----------------:|
| Adrenocortical_carcinoma | 3480 | 750 | 750 | 35 | 8 | 8 |
| Bladder_Urothelial_Carcinoma | 6990 | 1500 | 1500 | 202 | 43 | 44 |
| Brain_Lower_Grade_Glioma | 16480 | 3530 | 3520 | 326 | 70 | 71 |
| Breast_invasive_carcinoma | 16580 | 3550 | 3560 | 513 | 110 | 111 |
| Cervical_squamous_cell_carcinoma_and_endocervical_adenocarcinoma | 4380 | 930 | 960 | 140 | 30 | 31 |
| Cholangiocarcinoma | 630 | 120 | 150 | 21 | 4 | 5 |
| Colon_Rectum_adenocarcinoma | 7020 | 1510 | 1500 | 190 | 41 | 41 |
| Esophageal_carcinoma | 2360 | 510 | 510 | 78 | 17 | 17 |
| Glioblastoma_multiforme | 16620 | 3570 | 3550 | 254 | 54 | 55 |
| Head_and_Neck_squamous_cell_carcinoma | 8250 | 1770 | 1770 | 221 | 48 | 48 |
| Kidney_Chromophobe | 1710 | 360 | 390 | 57 | 12 | 13 |
| Kidney_renal_clear_cell_carcinoma | 8160 | 1740 | 1750 | 269 | 58 | 58 |
| Kidney_renal_papillary_cell_carcinoma | 4750 | 1020 | 1020 | 149 | 32 | 33 |
| Liver_hepatocellular_carcinoma | 5860 | 1250 | 1260 | 190 | 41 | 41 |
| Lung_adenocarcinoma | 11520 | 2470 | 2470 | 303 | 65 | 66 |
| Lung_squamous_cell_carcinoma | 11590 | 2490 | 2480 | 305 | 66 | 66 |
| Lymphoid_Neoplasm_Diffuse_Large_B-cell_Lymphoma | 570 | 120 | 150 | 19 | 4 | 5 |
| Mesothelioma | 1470 | 320 | 300 | 42 | 9 | 10 |
| Ovarian_serous_cystadenocarcinoma | 1740 | 390 | 390 | 58 | 13 | 13 |
| Pancreatic_adenocarcinoma | 2850 | 620 | 620 | 88 | 19 | 19 |
| Pheochromocytoma_and_Paraganglioma | 930 | 210 | 210 | 30 | 7 | 7 |
| Prostate_adenocarcinoma | 6870 | 1470 | 1470 | 212 | 45 | 46 |
| Sarcoma | 9440 | 2010 | 2030 | 149 | 32 | 32 |
| Skin_Cutaneous_Melanoma | 7040 | 1510 | 1510 | 226 | 48 | 49 |
| Stomach_adenocarcinoma | 6770 | 1450 | 1450 | 182 | 39 | 39 |
| Testicular_Germ_Cell_Tumors | 4210 | 900 | 900 | 92 | 20 | 20 |
| Thymoma | 2520 | 540 | 540 | 59 | 13 | 13 |
| Thyroid_carcinoma | 7950 | 1710 | 1700 | 259 | 56 | 56 |
| Uterine_Carcinosarcoma | 1470 | 320 | 330 | 34 | 7 | 8 |
| Uterine_Corpus_Endometrial_Carcinoma | 8730 | 1890 | 1860 | 266 | 57 | 58 |
| Uveal_Melanoma | 1140 | 240 | 260 | 38 | 8 | 9 |
| **Total** | 190080 | 40770 | 40860 | 5007 | 1076 | 1092 |
### External Split: TCGA-UT-External
| case | train (patches) | valid (patches) | test (patches) | train (patients) | valid (patients) | test (patients) |
|:-------------------------------------------------------------|----------------:|----------------:|---------------:|-----------------:|-----------------:|----------------:|
| Adrenocortical_carcinoma | 4500 | 390 | 90 | 45 | 5 | 1 |
| Bladder_Urothelial_Carcinoma | 6990 | 1500 | 1500 | 190 | 50 | 49 |
| Brain_Lower_Grade_Glioma | 16430 | 3540 | 3560 | 332 | 80 | 55 |
| Breast_invasive_carcinoma | 16560 | 3570 | 3560 | 509 | 116 | 109 |
| Cervical_squamous_cell_carcinoma_and_endocervical_adenocarcinoma | 4380 | 930 | 960 | 145 | 31 | 25 |
| Cholangiocarcinoma | 660 | 150 | 90 | 22 | 5 | 3 |
| Colon_Rectum_adenocarcinoma | 7020 | 1500 | 1510 | 197 | 39 | 36 |
| Esophageal_carcinoma | 2360 | 510 | 510 | 78 | 17 | 17 |
| Glioblastoma_multiforme | 16630 | 3810 | 3300 | 244 | 76 | 43 |
| Head_and_Neck_squamous_cell_carcinoma | 8260 | 1750 | 1780 | 224 | 51 | 42 |
| Kidney_Chromophobe | 1740 | 270 | 450 | 58 | 9 | 15 |
| Kidney_renal_clear_cell_carcinoma | 8170 | 1710 | 1770 | 269 | 57 | 59 |
| Kidney_renal_papillary_cell_carcinoma | 4750 | 1020 | 1020 | 146 | 34 | 34 |
| Liver_hepatocellular_carcinoma | 5870 | 1300 | 1200 | 189 | 43 | 40 |
| Lung_adenocarcinoma | 11530 | 2470 | 2460 | 288 | 77 | 69 |
| Lung_squamous_cell_carcinoma | 11580 | 2490 | 2490 | 296 | 68 | 73 |
| Lymphoid_Neoplasm_Diffuse_Large_B-cell_Lymphoma | 600 | 90 | 150 | 20 | 3 | 5 |
| Mesothelioma | 1470 | 300 | 320 | 43 | 10 | 8 |
| Ovarian_serous_cystadenocarcinoma | 2220 | 120 | 180 | 74 | 4 | 6 |
| Pancreatic_adenocarcinoma | 2860 | 600 | 630 | 85 | 20 | 21 |
| Pheochromocytoma_and_Paraganglioma | 1170 | 90 | 90 | 38 | 3 | 3 |
| Prostate_adenocarcinoma | 6870 | 1470 | 1470 | 226 | 49 | 28 |
| Sarcoma | 9490 | 2070 | 1920 | 154 | 28 | 31 |
| Skin_Cutaneous_Melanoma | 7030 | 1530 | 1500 | 233 | 40 | 50 |
| Stomach_adenocarcinoma | 6990 | 1330 | 1350 | 187 | 37 | 36 |
| Testicular_Germ_Cell_Tumors | 4600 | 630 | 780 | 96 | 10 | 26 |
| Thymoma | 2520 | 540 | 540 | 54 | 18 | 13 |
| Thyroid_carcinoma | 7980 | 1650 | 1730 | 259 | 54 | 58 |
| Uterine_Carcinosarcoma | 1470 | 330 | 320 | 37 | 7 | 5 |
| Uterine_Corpus_Endometrial_Carcinoma | 8730 | 1890 | 1860 | 272 | 48 | 61 |
| Uveal_Melanoma | 1250 | 120 | 270 | 42 | 4 | 9 |
| **Total** | 192680 | 39670 | 39360 | 5052 | 1093 | 1030 |
## Foundation Model Benchmarking
We provide example implementations using four state-of-the-art foundation models:
- [CONCH](https://huggingface.co/MahmoodLab/CONCH)
- [GigaPath](https://huggingface.co/prov-gigapath/prov-gigapath)
- [UNI](https://huggingface.co/MahmoodLab/UNI)
- [UNI2](https://huggingface.co/MahmoodLab/UNI2-h)
- [H-Optimus-0](https://huggingface.co/bioptimus/H-optimus-0)
- [H-Optimus-1](https://huggingface.co/bioptimus/H-optimus-1)
- [Virchow](https://huggingface.co/paige-ai/Virchow)
- [Virchow2](https://huggingface.co/paige-ai/Virchow2)
- [Phikon](https://huggingface.co/owkin/phikon)
- [Phikon-v2](https://huggingface.co/owkin/phikon-v2)
- [Kaiko](https://github.com/kaiko-ai/towards_large_pathology_fms)
- [Lunit](https://huggingface.co/1aurent/vit_small_patch8_224.lunit_dino)
- [Hibou](https://huggingface.co/histai/hibou-L)
- [CTransPath](https://github.com/Xiyue-Wang/TransPath)
- ResNet
See `licenses/references.txt` for model citations.
### Benchmark Results
**Note:** The provided script is a simplified example of training code. In practice, hyperparameter tuning and additional techniques were employed to achieve the following results.
#### Internal Split Results
| Model | Accuracy (LogReg) | Balanced Accuracy (LogReg) | Accuracy (KNN) | Balanced Accuracy (KNN) | Accuracy (Prototype) | Balanced Accuracy (Prototype) |
|-----|-----------------|--------------------------|--------------|-----------------------|--------------------|-----------------------------|
| Kaiko(l14)* | 0.8608 | **0.8662** | 0.8116 | 0.7636 | 0.7708 | 0.7434 |
| H-Optimus-1 | **0.8616** | 0.8557 | **0.8164** | **0.7671** | **0.7730** | **0.7579** |
| UNI2 | 0.8564 | 0.8501 | 0.7962 | 0.7434 | 0.7546 | 0.7476 |
| H-Optimus-0 | 0.8498 | 0.8399 | 0.7930 | 0.7307 | 0.7492 | 0.7321 |
| Virchow2 | 0.8455 | 0.8351 | 0.7686 | 0.6989 | 0.6671 | 0.6500 |
| Phikon-v2 | 0.8289 | 0.8212 | 0.7467 | 0.6777 | 0.6982 | 0.6869 |
| Phikon | 0.8342 | 0.8111 | 0.7207 | 0.6255 | 0.6625 | 0.6385 |
| Virchow | 0.8223 | 0.8008 | 0.7244 | 0.6262 | 0.6087 | 0.5759 |
| Hibou | 0.8189 | 0.7985 | 0.7433 | 0.6618 | 0.6291 | 0.6034 |
| UNI | 0.8144 | 0.7923 | 0.7634 | 0.6897 | 0.7109 | 0.6946 |
| GigaPath | 0.8161 | 0.7878 | 0.7444 | 0.6676 | 0.6967 | 0.6675 |
| Lunit* | 0.7919 | 0.7535 | 0.7427 | 0.6539 | 0.6611 | 0.6427 |
| CONCH | 0.7672 | 0.7295 | 0.7028 | 0.6139 | 0.6150 | 0.6097 |
| CTransPath | 0.7255 | 0.6748 | 0.6200 | 0.5057 | 0.5158 | 0.4857 |
| ResNet | 0.6395 | 0.5581 | 0.5114 | 0.3816 | 0.3154 | 0.2973 |
\* Training data contains TCGA dataset.
#### External Split Results
| Model | Accuracy (LogReg) | Balanced Accuracy (LogReg) | Accuracy (KNN) | Balanced Accuracy (KNN) | Accuracy (Prototype) | Balanced Accuracy (Prototype) |
|-----|-----------------|--------------------------|--------------|-----------------------|--------------------|-----------------------------|
| H-Optimus-1 | **0.8080** | **0.7450** | **0.7700** | **0.6955** | **0.7572** | **0.7363** |
| Kaiko(b8)* | 0.7920 | 0.7370 | 0.7181 | 0.6597 | 0.7509 | 0.7134 |
| UNI2 | 0.7648 | 0.7262 | 0.7210 | 0.6498 | 0.7018 | 0.6839 |
| H-Optimus-0 | 0.7845 | 0.7213 | 0.7209 | 0.6579 | 0.7106 | 0.6842 |
| Virchow2 | 0.7744 | 0.6919 | 0.7221 | 0.6544 | 0.6482 | 0.6331 |
| UNI | 0.7373 | 0.6581 | 0.6668 | 0.5887 | 0.6612 | 0.6232 |
| Phikon-v2 | 0.7185 | 0.6535 | 0.5857 | 0.5040 | 0.6197 | 0.5752 |
| Virchow | 0.7274 | 0.6490 | 0.6464 | 0.5541 | 0.5847 | 0.5636 |
| GigaPath | 0.7246 | 0.6379 | 0.6426 | 0.5495 | 0.6361 | 0.5960 |
| Phikon | 0.7311 | 0.6351 | 0.5511 | 0.4586 | 0.5474 | 0.5104 |
| Hibou | 0.6696 | 0.6161 | 0.5155 | 0.4436 | 0.4911 | 0.4765 |
| Lunit* | 0.6851 | 0.6044 | 0.6021 | 0.5098 | 0.5862 | 0.5503 |
| CONCH | 0.6991 | 0.5975 | 0.6626 | 0.5735 | 0.5954 | 0.5905 |
| CTransPath | 0.6160 | 0.5215 | 0.5229 | 0.4205 | 0.4498 | 0.4128 |
| ResNet | 0.4967 | 0.3929 | 0.3960 | 0.2871 | 0.2657 | 0.2392 |
\* Training data contains TCGA dataset.
### Getting Started
1. Clone this repository:
```bash
git clone [repository-url]
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Login Hugging Face:
- The first time you run the program, you must log in with a Hugging Face account that has access to the dataset and the model you wish to use.
4. (Optional) Setup:
- A notebook file `setup.ipynb` is provided for repository cloning, environment setup, and code execution. It has been confirmed to work in the Google Colaboratory environment.
### Troubleshooting
#### Dependencies Installation
While `requirements.txt` specifies version numbers for dependencies, some installations might require additional steps or alternative approaches depending on your system configuration:
1. **SPAMS Library Installation**
- If the standard SPAMS installation fails, try:
```bash
pip install spams-bin
```
- On some systems, you might need to install additional system libraries:
```bash
pip install PyOpenGL PyOpenGL_accelerate
```
2. **Version Compatibility**
- While we specify exact versions in `requirements.txt`, some dependencies might require different versions based on your hardware configuration
- If you encounter compatibility issues, try installing without version constraints and test functionality
#### Dataset Label Data Type Issues
When creating the dataset, there is a possibility that an error occurs due to the data type of the label. If you encounter such an issue, try modifying line 83 in `extract_train.py` as follows:
From:
```python
label = torch.tensor(self.labels[idx], dtype=torch.long)
```
To:
```python
label = torch.tensor(int(self.labels[idx]), dtype=torch.long)
```
### Data Loading Example
The dataset uses WebDataset format for efficient loading. Here's an example from `extract_train.py`:
```python
patterns = {
'train': [os.path.join(work_dir, f"data/dataset_{split}_train_part{str(i).zfill(3)}.tar") for i in range(39)],
'valid': [os.path.join(work_dir, f"data/dataset_{split}_valid_part{str(i).zfill(3)}.tar") for i in range(file_range)],
'test': [os.path.join(work_dir, f"data/dataset_{split}_test_part{str(i).zfill(3)}.tar") for i in range(file_range)],
}
dataset = wds.WebDataset(patterns[mode], shardshuffle=False) \
.shuffle(buffer_size, seed=42) \
.decode("pil").to_tuple("jpg", "json") \
.map_tuple(func_transform, lambda x: encode_labels([x["label"]], label_encoder))
```
### Configuration and Usage
1. Configure your experiment in `config.yaml`:
```yaml
model_name: "h_optimus" # Model selection: "h_optimus", etc.
split_type: "internal" # Split type: "internal" or "external"
device: "cuda" # Computation device: "cuda" or "cpu"
eval_name: "logreg" # Evaluation method: "logreg", "knn", or "proto"
feature_exist: True # Skip feature extraction if features already exist
max_iter: 1000 # Maximum iterations for training
cost: 0.0001 # Cost parameter for logistic regression
```
Configuration parameters:
- `model_name`: Foundation model to use for feature extraction
- `split_type`: Dataset split strategy
- `eval_name`: Methods of evaluation (logreg, knn, proto)
- `device`: Computation device (GPU/CPU)
- `feature_exist`: Skip feature extraction if True and features are already available
- `max_iter`: Maximum training iterations for logistic regression
- `cost`: Regularization parameter for logistic regression
- `k`: Number of Nearest Neighbors in KNN
2. Define models and transforms in `extract_train.py`:
```python
def get_model_transform(model_name):
# Add your model and transform definitions here
pass
```
3. Run the experiment:
```bash
python extract_train.py
```
This will:
- Extract features using the specified foundation model
- Save features to H5 files
- Perform linear probing, KNN, and prototype classification
- Output accuracy and balanced accuracy metrics
## License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC-BY-NC-SA 4.0).
- For non-commercial use: Please use the dataset under CC-BY-NC-SA
- For commercial use: Please contact us at ishum-prm@m.u-tokyo.ac.jp
## Citation
If you use this dataset, please cite the original paper:
```bibtex
@article{komura2022universal,
title={Universal encoding of pan-cancer histology by deep texture representations},
author={Komura, D., Kawabe, A., Fukuta, K., Sano, K., Umezaki, T., Koda, H., Suzuki, R., Tominaga, K., Ochi, M., Konishi, H., Masakado, F., Saito, N., Sato, Y., Onoyama, T., Nishida, S., Furuya, G., Katoh, H., Yamashita, H., Kakimi, K., Seto, Y., Ushiku, T., Fukayama, M., Ishikawa, S.},
journal={Cell Reports},
volume={38},
pages={110424},
year={2022},
doi={10.1016/j.celrep.2022.110424}
}
```
提供机构:
vaaaasss



