Name: vaaaasss/tcga-ut
Creator: vaaaasss
Published: 2026-03-04 16:02:09
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/vaaaasss/tcga-ut

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 dataset_info: features: - name: __key__ dtype: string - name: __url__ dtype: string - name: jpg dtype: image - name: json struct: - name: label dtype: string splits: - name: internal_train num_bytes: 4231024640 num_examples: 190080 - name: internal_valid num_bytes: 910479360 num_examples: 40770 - name: internal_test num_bytes: 911656960 num_examples: 40860 - name: external_train num_bytes: 4291031040 num_examples: 192680 - name: external_valid num_bytes: 893061120 num_examples: 39670 - name: external_test num_bytes: 869038080 num_examples: 39360 download_size: 12106291200 configs: - config_name: internal data_files: - split: train path: data/dataset_internal_train_part*.tar - split: valid path: data/dataset_internal_valid_part*.tar - split: test path: data/dataset_internal_test_part*.tar default: true - config_name: external data_files: - split: train path: data/dataset_external_train_part*.tar - split: valid path: data/dataset_external_valid_part*.tar - split: test path: data/dataset_external_test_part*.tar tags: - histology - pathology - webdataset - image task_categories: - image-feature-extraction - image-classification --- # Histology images from uniform tumor regions in TCGA Whole Slide Images (TCGA-UT-Internal, TCGA-UT-External) <div style="text-align: center;"> <img src="logo.webp" width="600" alt="TCGA Histology Dataset Logo"> </div> This repository provides a benchmarking framework for the TCGA histology image dataset originally published on [Zenodo](https://zenodo.org/records/5889558). It includes predefined train/validation/test splits and example code for foundation model evaluation. ## Task Classification of 31 different cancer types from tumor histopathological images. ## Original Dataset Description This dataset contains 1,608,060 image patches of hematoxylin & eosin stained histological samples from various human cancers. The data was collected and processed as follows: - Source: TCGA dataset from 32 solid cancer types (GDC legacy database, downloaded between December 1, 2016, and June 19, 2017) - Initial data: 9,662 diagnostic slides from 7,951 patients in SVS format - Annotation: At least three representative tumor regions were selected as polygons by two trained pathologists - Quality control: 926 slides were removed due to poor staining, low resolution, out-of-focus issues, absence of cancerous regions, or incorrect cancer types - Final dataset: 8,736 diagnostic slides from 7,175 patients - Patch extraction: 10 patches at 0.5 μm/pixel resolution (128 x 128 μm) were randomly cropped from each annotated region Note: Additional resolution levels are available in the original Zenodo dataset. Please refer to the Zenodo repository for the complete dataset. TCGA Barcode format (TCGA-XX-XXXX) represents patient ID. For details, see the [TCGA Barcode documentation](https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/). ## Updates in This Version The dataset has been modified and organized for benchmarking purposes: 1. **Label Consolidation**: - Colon Adenocarcinoma (COAD) and Rectum Adenocarcinoma (READ) have been merged due to their histological similarity 2. **Structured Splits**: ### Internal Split (70:15:15): TCGA-UT-Internal - Ensures no patient overlap between train, validation, and test sets - Approximate distribution: 70% train, 15% validation, 15% test ### External Split: TCGA-UT-External - Separates data based on medical facilities to evaluate cross-institutional generalization - No facility overlap between train, validation, and test sets - Maintains similar class distributions across splits ## Dataset Details ### Internal Split: TCGA-UT-Internal | case | train (patches) | valid (patches) | test (patches) | train (patients) | valid (patients) | test (patients) | |:-------------------------------------------------------------|----------------:|----------------:|---------------:|-----------------:|-----------------:|----------------:| | Adrenocortical_carcinoma | 3480 | 750 | 750 | 35 | 8 | 8 | | Bladder_Urothelial_Carcinoma | 6990 | 1500 | 1500 | 202 | 43 | 44 | | Brain_Lower_Grade_Glioma | 16480 | 3530 | 3520 | 326 | 70 | 71 | | Breast_invasive_carcinoma | 16580 | 3550 | 3560 | 513 | 110 | 111 | | Cervical_squamous_cell_carcinoma_and_endocervical_adenocarcinoma | 4380 | 930 | 960 | 140 | 30 | 31 | | Cholangiocarcinoma | 630 | 120 | 150 | 21 | 4 | 5 | | Colon_Rectum_adenocarcinoma | 7020 | 1510 | 1500 | 190 | 41 | 41 | | Esophageal_carcinoma | 2360 | 510 | 510 | 78 | 17 | 17 | | Glioblastoma_multiforme | 16620 | 3570 | 3550 | 254 | 54 | 55 | | Head_and_Neck_squamous_cell_carcinoma | 8250 | 1770 | 1770 | 221 | 48 | 48 | | Kidney_Chromophobe | 1710 | 360 | 390 | 57 | 12 | 13 | | Kidney_renal_clear_cell_carcinoma | 8160 | 1740 | 1750 | 269 | 58 | 58 | | Kidney_renal_papillary_cell_carcinoma | 4750 | 1020 | 1020 | 149 | 32 | 33 | | Liver_hepatocellular_carcinoma | 5860 | 1250 | 1260 | 190 | 41 | 41 | | Lung_adenocarcinoma | 11520 | 2470 | 2470 | 303 | 65 | 66 | | Lung_squamous_cell_carcinoma | 11590 | 2490 | 2480 | 305 | 66 | 66 | | Lymphoid_Neoplasm_Diffuse_Large_B-cell_Lymphoma | 570 | 120 | 150 | 19 | 4 | 5 | | Mesothelioma | 1470 | 320 | 300 | 42 | 9 | 10 | | Ovarian_serous_cystadenocarcinoma | 1740 | 390 | 390 | 58 | 13 | 13 | | Pancreatic_adenocarcinoma | 2850 | 620 | 620 | 88 | 19 | 19 | | Pheochromocytoma_and_Paraganglioma | 930 | 210 | 210 | 30 | 7 | 7 | | Prostate_adenocarcinoma | 6870 | 1470 | 1470 | 212 | 45 | 46 | | Sarcoma | 9440 | 2010 | 2030 | 149 | 32 | 32 | | Skin_Cutaneous_Melanoma | 7040 | 1510 | 1510 | 226 | 48 | 49 | | Stomach_adenocarcinoma | 6770 | 1450 | 1450 | 182 | 39 | 39 | | Testicular_Germ_Cell_Tumors | 4210 | 900 | 900 | 92 | 20 | 20 | | Thymoma | 2520 | 540 | 540 | 59 | 13 | 13 | | Thyroid_carcinoma | 7950 | 1710 | 1700 | 259 | 56 | 56 | | Uterine_Carcinosarcoma | 1470 | 320 | 330 | 34 | 7 | 8 | | Uterine_Corpus_Endometrial_Carcinoma | 8730 | 1890 | 1860 | 266 | 57 | 58 | | Uveal_Melanoma | 1140 | 240 | 260 | 38 | 8 | 9 | | **Total** | 190080 | 40770 | 40860 | 5007 | 1076 | 1092 | ### External Split: TCGA-UT-External | case | train (patches) | valid (patches) | test (patches) | train (patients) | valid (patients) | test (patients) | |:-------------------------------------------------------------|----------------:|----------------:|---------------:|-----------------:|-----------------:|----------------:| | Adrenocortical_carcinoma | 4500 | 390 | 90 | 45 | 5 | 1 | | Bladder_Urothelial_Carcinoma | 6990 | 1500 | 1500 | 190 | 50 | 49 | | Brain_Lower_Grade_Glioma | 16430 | 3540 | 3560 | 332 | 80 | 55 | | Breast_invasive_carcinoma | 16560 | 3570 | 3560 | 509 | 116 | 109 | | Cervical_squamous_cell_carcinoma_and_endocervical_adenocarcinoma | 4380 | 930 | 960 | 145 | 31 | 25 | | Cholangiocarcinoma | 660 | 150 | 90 | 22 | 5 | 3 | | Colon_Rectum_adenocarcinoma | 7020 | 1500 | 1510 | 197 | 39 | 36 | | Esophageal_carcinoma | 2360 | 510 | 510 | 78 | 17 | 17 | | Glioblastoma_multiforme | 16630 | 3810 | 3300 | 244 | 76 | 43 | | Head_and_Neck_squamous_cell_carcinoma | 8260 | 1750 | 1780 | 224 | 51 | 42 | | Kidney_Chromophobe | 1740 | 270 | 450 | 58 | 9 | 15 | | Kidney_renal_clear_cell_carcinoma | 8170 | 1710 | 1770 | 269 | 57 | 59 | | Kidney_renal_papillary_cell_carcinoma | 4750 | 1020 | 1020 | 146 | 34 | 34 | | Liver_hepatocellular_carcinoma | 5870 | 1300 | 1200 | 189 | 43 | 40 | | Lung_adenocarcinoma | 11530 | 2470 | 2460 | 288 | 77 | 69 | | Lung_squamous_cell_carcinoma | 11580 | 2490 | 2490 | 296 | 68 | 73 | | Lymphoid_Neoplasm_Diffuse_Large_B-cell_Lymphoma | 600 | 90 | 150 | 20 | 3 | 5 | | Mesothelioma | 1470 | 300 | 320 | 43 | 10 | 8 | | Ovarian_serous_cystadenocarcinoma | 2220 | 120 | 180 | 74 | 4 | 6 | | Pancreatic_adenocarcinoma | 2860 | 600 | 630 | 85 | 20 | 21 | | Pheochromocytoma_and_Paraganglioma | 1170 | 90 | 90 | 38 | 3 | 3 | | Prostate_adenocarcinoma | 6870 | 1470 | 1470 | 226 | 49 | 28 | | Sarcoma | 9490 | 2070 | 1920 | 154 | 28 | 31 | | Skin_Cutaneous_Melanoma | 7030 | 1530 | 1500 | 233 | 40 | 50 | | Stomach_adenocarcinoma | 6990 | 1330 | 1350 | 187 | 37 | 36 | | Testicular_Germ_Cell_Tumors | 4600 | 630 | 780 | 96 | 10 | 26 | | Thymoma | 2520 | 540 | 540 | 54 | 18 | 13 | | Thyroid_carcinoma | 7980 | 1650 | 1730 | 259 | 54 | 58 | | Uterine_Carcinosarcoma | 1470 | 330 | 320 | 37 | 7 | 5 | | Uterine_Corpus_Endometrial_Carcinoma | 8730 | 1890 | 1860 | 272 | 48 | 61 | | Uveal_Melanoma | 1250 | 120 | 270 | 42 | 4 | 9 | | **Total** | 192680 | 39670 | 39360 | 5052 | 1093 | 1030 | ## Foundation Model Benchmarking We provide example implementations using four state-of-the-art foundation models: - [CONCH](https://huggingface.co/MahmoodLab/CONCH) - [GigaPath](https://huggingface.co/prov-gigapath/prov-gigapath) - [UNI](https://huggingface.co/MahmoodLab/UNI) - [UNI2](https://huggingface.co/MahmoodLab/UNI2-h) - [H-Optimus-0](https://huggingface.co/bioptimus/H-optimus-0) - [H-Optimus-1](https://huggingface.co/bioptimus/H-optimus-1) - [Virchow](https://huggingface.co/paige-ai/Virchow) - [Virchow2](https://huggingface.co/paige-ai/Virchow2) - [Phikon](https://huggingface.co/owkin/phikon) - [Phikon-v2](https://huggingface.co/owkin/phikon-v2) - [Kaiko](https://github.com/kaiko-ai/towards_large_pathology_fms) - [Lunit](https://huggingface.co/1aurent/vit_small_patch8_224.lunit_dino) - [Hibou](https://huggingface.co/histai/hibou-L) - [CTransPath](https://github.com/Xiyue-Wang/TransPath) - ResNet See `licenses/references.txt` for model citations. ### Benchmark Results **Note:** The provided script is a simplified example of training code. In practice, hyperparameter tuning and additional techniques were employed to achieve the following results. #### Internal Split Results | Model | Accuracy (LogReg) | Balanced Accuracy (LogReg) | Accuracy (KNN) | Balanced Accuracy (KNN) | Accuracy (Prototype) | Balanced Accuracy (Prototype) | |-----|-----------------|--------------------------|--------------|-----------------------|--------------------|-----------------------------| | Kaiko(l14)* | 0.8608 | **0.8662** | 0.8116 | 0.7636 | 0.7708 | 0.7434 | | H-Optimus-1 | **0.8616** | 0.8557 | **0.8164** | **0.7671** | **0.7730** | **0.7579** | | UNI2 | 0.8564 | 0.8501 | 0.7962 | 0.7434 | 0.7546 | 0.7476 | | H-Optimus-0 | 0.8498 | 0.8399 | 0.7930 | 0.7307 | 0.7492 | 0.7321 | | Virchow2 | 0.8455 | 0.8351 | 0.7686 | 0.6989 | 0.6671 | 0.6500 | | Phikon-v2 | 0.8289 | 0.8212 | 0.7467 | 0.6777 | 0.6982 | 0.6869 | | Phikon | 0.8342 | 0.8111 | 0.7207 | 0.6255 | 0.6625 | 0.6385 | | Virchow | 0.8223 | 0.8008 | 0.7244 | 0.6262 | 0.6087 | 0.5759 | | Hibou | 0.8189 | 0.7985 | 0.7433 | 0.6618 | 0.6291 | 0.6034 | | UNI | 0.8144 | 0.7923 | 0.7634 | 0.6897 | 0.7109 | 0.6946 | | GigaPath | 0.8161 | 0.7878 | 0.7444 | 0.6676 | 0.6967 | 0.6675 | | Lunit* | 0.7919 | 0.7535 | 0.7427 | 0.6539 | 0.6611 | 0.6427 | | CONCH | 0.7672 | 0.7295 | 0.7028 | 0.6139 | 0.6150 | 0.6097 | | CTransPath | 0.7255 | 0.6748 | 0.6200 | 0.5057 | 0.5158 | 0.4857 | | ResNet | 0.6395 | 0.5581 | 0.5114 | 0.3816 | 0.3154 | 0.2973 | \* Training data contains TCGA dataset. #### External Split Results | Model | Accuracy (LogReg) | Balanced Accuracy (LogReg) | Accuracy (KNN) | Balanced Accuracy (KNN) | Accuracy (Prototype) | Balanced Accuracy (Prototype) | |-----|-----------------|--------------------------|--------------|-----------------------|--------------------|-----------------------------| | H-Optimus-1 | **0.8080** | **0.7450** | **0.7700** | **0.6955** | **0.7572** | **0.7363** | | Kaiko(b8)* | 0.7920 | 0.7370 | 0.7181 | 0.6597 | 0.7509 | 0.7134 | | UNI2 | 0.7648 | 0.7262 | 0.7210 | 0.6498 | 0.7018 | 0.6839 | | H-Optimus-0 | 0.7845 | 0.7213 | 0.7209 | 0.6579 | 0.7106 | 0.6842 | | Virchow2 | 0.7744 | 0.6919 | 0.7221 | 0.6544 | 0.6482 | 0.6331 | | UNI | 0.7373 | 0.6581 | 0.6668 | 0.5887 | 0.6612 | 0.6232 | | Phikon-v2 | 0.7185 | 0.6535 | 0.5857 | 0.5040 | 0.6197 | 0.5752 | | Virchow | 0.7274 | 0.6490 | 0.6464 | 0.5541 | 0.5847 | 0.5636 | | GigaPath | 0.7246 | 0.6379 | 0.6426 | 0.5495 | 0.6361 | 0.5960 | | Phikon | 0.7311 | 0.6351 | 0.5511 | 0.4586 | 0.5474 | 0.5104 | | Hibou | 0.6696 | 0.6161 | 0.5155 | 0.4436 | 0.4911 | 0.4765 | | Lunit* | 0.6851 | 0.6044 | 0.6021 | 0.5098 | 0.5862 | 0.5503 | | CONCH | 0.6991 | 0.5975 | 0.6626 | 0.5735 | 0.5954 | 0.5905 | | CTransPath | 0.6160 | 0.5215 | 0.5229 | 0.4205 | 0.4498 | 0.4128 | | ResNet | 0.4967 | 0.3929 | 0.3960 | 0.2871 | 0.2657 | 0.2392 | \* Training data contains TCGA dataset. ### Getting Started 1. Clone this repository: ```bash git clone [repository-url] ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Login Hugging Face: - The first time you run the program, you must log in with a Hugging Face account that has access to the dataset and the model you wish to use. 4. (Optional) Setup: - A notebook file `setup.ipynb` is provided for repository cloning, environment setup, and code execution. It has been confirmed to work in the Google Colaboratory environment. ### Troubleshooting #### Dependencies Installation While `requirements.txt` specifies version numbers for dependencies, some installations might require additional steps or alternative approaches depending on your system configuration: 1. **SPAMS Library Installation** - If the standard SPAMS installation fails, try: ```bash pip install spams-bin ``` - On some systems, you might need to install additional system libraries: ```bash pip install PyOpenGL PyOpenGL_accelerate ``` 2. **Version Compatibility** - While we specify exact versions in `requirements.txt`, some dependencies might require different versions based on your hardware configuration - If you encounter compatibility issues, try installing without version constraints and test functionality #### Dataset Label Data Type Issues When creating the dataset, there is a possibility that an error occurs due to the data type of the label. If you encounter such an issue, try modifying line 83 in `extract_train.py` as follows: From: ```python label = torch.tensor(self.labels[idx], dtype=torch.long) ``` To: ```python label = torch.tensor(int(self.labels[idx]), dtype=torch.long) ``` ### Data Loading Example The dataset uses WebDataset format for efficient loading. Here's an example from `extract_train.py`: ```python patterns = { 'train': [os.path.join(work_dir, f"data/dataset_{split}_train_part{str(i).zfill(3)}.tar") for i in range(39)], 'valid': [os.path.join(work_dir, f"data/dataset_{split}_valid_part{str(i).zfill(3)}.tar") for i in range(file_range)], 'test': [os.path.join(work_dir, f"data/dataset_{split}_test_part{str(i).zfill(3)}.tar") for i in range(file_range)], } dataset = wds.WebDataset(patterns[mode], shardshuffle=False) \ .shuffle(buffer_size, seed=42) \ .decode("pil").to_tuple("jpg", "json") \ .map_tuple(func_transform, lambda x: encode_labels([x["label"]], label_encoder)) ``` ### Configuration and Usage 1. Configure your experiment in `config.yaml`: ```yaml model_name: "h_optimus" # Model selection: "h_optimus", etc. split_type: "internal" # Split type: "internal" or "external" device: "cuda" # Computation device: "cuda" or "cpu" eval_name: "logreg" # Evaluation method: "logreg", "knn", or "proto" feature_exist: True # Skip feature extraction if features already exist max_iter: 1000 # Maximum iterations for training cost: 0.0001 # Cost parameter for logistic regression ``` Configuration parameters: - `model_name`: Foundation model to use for feature extraction - `split_type`: Dataset split strategy - `eval_name`: Methods of evaluation (logreg, knn, proto) - `device`: Computation device (GPU/CPU) - `feature_exist`: Skip feature extraction if True and features are already available - `max_iter`: Maximum training iterations for logistic regression - `cost`: Regularization parameter for logistic regression - `k`: Number of Nearest Neighbors in KNN 2. Define models and transforms in `extract_train.py`: ```python def get_model_transform(model_name): # Add your model and transform definitions here pass ``` 3. Run the experiment: ```bash python extract_train.py ``` This will: - Extract features using the specified foundation model - Save features to H5 files - Perform linear probing, KNN, and prototype classification - Output accuracy and balanced accuracy metrics ## License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC-BY-NC-SA 4.0). - For non-commercial use: Please use the dataset under CC-BY-NC-SA - For commercial use: Please contact us at ishum-prm@m.u-tokyo.ac.jp ## Citation If you use this dataset, please cite the original paper: ```bibtex @article{komura2022universal, title={Universal encoding of pan-cancer histology by deep texture representations}, author={Komura, D., Kawabe, A., Fukuta, K., Sano, K., Umezaki, T., Koda, H., Suzuki, R., Tominaga, K., Ochi, M., Konishi, H., Masakado, F., Saito, N., Sato, Y., Onoyama, T., Nishida, S., Furuya, G., Katoh, H., Yamashita, H., Kakimi, K., Seto, Y., Ushiku, T., Fukayama, M., Ishikawa, S.}, journal={Cell Reports}, volume={38}, pages={110424}, year={2022}, doi={10.1016/j.celrep.2022.110424} } ```

应用场景：