ogutsevda/graph-tcga-brca
收藏Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ogutsevda/graph-tcga-brca
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- graph-ml
tags:
- histopathology
- graph-classification
- breast-cancer
- pytorch-geometric
pretty_name: Graph-TCGA-BRCA
size_categories:
- 1M<n<10M
---
# Graph-TCGA-BRCA: A Cell-Graph Dataset for Breast Cancer from TCGA-BRCA
<p align="center">
<img src="preprocessing.png" width="600"/>
</p>
**Graph-TCGA-BRCA** is a graph-level classification dataset derived from the [TCGA-BRCA](https://portal.gdc.cancer.gov/projects/TCGA-BRCA) histopathology dataset. Each 224x224 patch image is converted into a **cell-graph** where nodes represent detected cell nuclei and edges encode spatial proximity, enabling graph-based learning for fine-grained breast lesion subtyping across 2 clinically relevant classes. Note that node features describe cell morphology, texture, and color intensity whereas edge features are Euclidean distance in micrometers.
This dataset is part of the paper [GrapHist: Graph Self-Supervised Learning for Histopathology](https://arxiv.org/pdf/2603.00143).
> ⚠️ **Edge Weight Note**: While the architecture in GrapHist supports both positive and negative edge weights, by default edge features represent Euclidean distances—meaning farther nodes have larger, positive values. This can be counterintuitive for many graph neural network models. We recommend experimenting with edge weights, such as using their inverse (e.g., `1/distance`) or negative distance (e.g., `-distance`), to better capture proximity and benefit learning.
## Dataset Summary
| Property | Value |
|---|---|
| **Total graphs** | 11 149 500 |
| **Classes** | 2 |
| **Node feature dim** | 96 |
| **Edge feature dim** | 1 |
## Classes
| Label | Full Name | Count |
|---|---|---|
| `IDC` | Infiltrating Ductal Carcinoma| 794 |
| `LC` | Lobular Carcinoma | 204 |
## Data Structure
```
graph-tcga-brca/
├── README.md
├── metadata.csv # graph_path, sample_id, wsi_x, wsi_y, label, split
├── normalization.json # normalizer values for node and edge features computed from train patches
├── preprocessing.png
└── data/
├── graph-data-000000.tar
├── graph-data-000001.tar
└── ...
```
Each `.tar` file contains ~1 GB of `.pt` files. Please extract them into the `data` folder to use our code.
These `.pt` files are PyTorch Geometric `Data` objects with the following attributes:
| Attribute | Shape | Description |
|---|---|---|
| `x` | `[num_nodes, 96]` | Node feature matrix |
| `edge_index` | `[2, num_edges]` | Graph connectivity in COO format |
| `edge_attr` | `[num_edges, 1]` | Edge features |
| `sample_id` | `str` | Unique sample identifier |
| `label` | `str` | Class label |
| `wsi_x` | `str` | x coordinate of the patch (bottom left) |
| `wsi_y` | `str` | y coordinate of the patch (bottom left) |
---
## Quick Start
```python
import torch
from torch_geometric.data import Data
# Load a single graph
graph = torch.load("data/TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291_x55517_y57392.pt", weights_only=False)
print(graph)
# Data(x=[26, 96], edge_index=[2, 67], edge_attr=[67, 1], sample_id='TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291', label='Infiltrating duct carcinoma, NOS', wsi_x='55517', wsi_y='57392')
print(f"Nodes: {graph.x.shape[0]}, Edges: {graph.edge_index.shape[1]}")
# Nodes: 26, Edges: 67
```
---
## Citation
If you use this dataset, please cite both our work, and the original TCGA-BRCA dataset:
**GrapHist (this dataset):**
```bibtex
@misc{ogut2026graphist,
title={GrapHist: Graph Self-Supervised Learning for Histopathology},
author={Sevda Öğüt and Cédric Vincent-Cuaz and Natalia Dubljevic and Carlos Hurtado and Vaishnavi Subramanian and Pascal Frossard and Dorina Thanou},
year={2026},
eprint={2603.00143},
url={https://arxiv.org/abs/2603.00143},
}
```
**TCGA-BRCA (source images):**
```bibtex
@article{weinstein2013cancer,
title={The cancer genome atlas pan-cancer analysis project},
author={Weinstein, John N and Collisson, Eric A and Mills, Gordon B and Shaw, Kenna R and Ozenberger, Brad A and Ellrott, Kyle and Shmulevich, Ilya and Sander, Chris and Stuart, Joshua M},
journal={Nature Genetics},
volume={45},
number={10},
pages={1113--1120},
year={2013},
publisher={Nature Publishing Group}
}
```
---
## License
This dataset is released under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
提供机构:
ogutsevda



