five

ogutsevda/graph-tcga-brca

收藏
Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ogutsevda/graph-tcga-brca
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 task_categories: - graph-ml tags: - histopathology - graph-classification - breast-cancer - pytorch-geometric pretty_name: Graph-TCGA-BRCA size_categories: - 1M<n<10M --- # Graph-TCGA-BRCA: A Cell-Graph Dataset for Breast Cancer from TCGA-BRCA <p align="center"> <img src="preprocessing.png" width="600"/> </p> **Graph-TCGA-BRCA** is a graph-level classification dataset derived from the [TCGA-BRCA](https://portal.gdc.cancer.gov/projects/TCGA-BRCA) histopathology dataset. Each 224x224 patch image is converted into a **cell-graph** where nodes represent detected cell nuclei and edges encode spatial proximity, enabling graph-based learning for fine-grained breast lesion subtyping across 2 clinically relevant classes. Note that node features describe cell morphology, texture, and color intensity whereas edge features are Euclidean distance in micrometers. This dataset is part of the paper [GrapHist: Graph Self-Supervised Learning for Histopathology](https://arxiv.org/pdf/2603.00143). > ⚠️ **Edge Weight Note**: While the architecture in GrapHist supports both positive and negative edge weights, by default edge features represent Euclidean distances—meaning farther nodes have larger, positive values. This can be counterintuitive for many graph neural network models. We recommend experimenting with edge weights, such as using their inverse (e.g., `1/distance`) or negative distance (e.g., `-distance`), to better capture proximity and benefit learning. ## Dataset Summary | Property | Value | |---|---| | **Total graphs** | 11 149 500 | | **Classes** | 2 | | **Node feature dim** | 96 | | **Edge feature dim** | 1 | ## Classes | Label | Full Name | Count | |---|---|---| | `IDC` | Infiltrating Ductal Carcinoma| 794 | | `LC` | Lobular Carcinoma | 204 | ## Data Structure ``` graph-tcga-brca/ ├── README.md ├── metadata.csv # graph_path, sample_id, wsi_x, wsi_y, label, split ├── normalization.json # normalizer values for node and edge features computed from train patches ├── preprocessing.png └── data/ ├── graph-data-000000.tar ├── graph-data-000001.tar └── ... ``` Each `.tar` file contains ~1 GB of `.pt` files. Please extract them into the `data` folder to use our code. These `.pt` files are PyTorch Geometric `Data` objects with the following attributes: | Attribute | Shape | Description | |---|---|---| | `x` | `[num_nodes, 96]` | Node feature matrix | | `edge_index` | `[2, num_edges]` | Graph connectivity in COO format | | `edge_attr` | `[num_edges, 1]` | Edge features | | `sample_id` | `str` | Unique sample identifier | | `label` | `str` | Class label | | `wsi_x` | `str` | x coordinate of the patch (bottom left) | | `wsi_y` | `str` | y coordinate of the patch (bottom left) | --- ## Quick Start ```python import torch from torch_geometric.data import Data # Load a single graph graph = torch.load("data/TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291_x55517_y57392.pt", weights_only=False) print(graph) # Data(x=[26, 96], edge_index=[2, 67], edge_attr=[67, 1], sample_id='TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291', label='Infiltrating duct carcinoma, NOS', wsi_x='55517', wsi_y='57392') print(f"Nodes: {graph.x.shape[0]}, Edges: {graph.edge_index.shape[1]}") # Nodes: 26, Edges: 67 ``` --- ## Citation If you use this dataset, please cite both our work, and the original TCGA-BRCA dataset: **GrapHist (this dataset):** ```bibtex @misc{ogut2026graphist, title={GrapHist: Graph Self-Supervised Learning for Histopathology}, author={Sevda Öğüt and Cédric Vincent-Cuaz and Natalia Dubljevic and Carlos Hurtado and Vaishnavi Subramanian and Pascal Frossard and Dorina Thanou}, year={2026}, eprint={2603.00143}, url={https://arxiv.org/abs/2603.00143}, } ``` **TCGA-BRCA (source images):** ```bibtex @article{weinstein2013cancer, title={The cancer genome atlas pan-cancer analysis project}, author={Weinstein, John N and Collisson, Eric A and Mills, Gordon B and Shaw, Kenna R and Ozenberger, Brad A and Ellrott, Kyle and Shmulevich, Ilya and Sander, Chris and Stuart, Joshua M}, journal={Nature Genetics}, volume={45}, number={10}, pages={1113--1120}, year={2013}, publisher={Nature Publishing Group} } ``` --- ## License This dataset is released under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
提供机构:
ogutsevda
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作