SPIDER-breast
收藏魔搭社区2025-11-27 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/histai/SPIDER-breast
下载链接
链接失效反馈官方服务:
资源简介:
# SPIDER-BREAST Dataset
SPIDER is a collection of supervised pathological datasets covering multiple organs, each with comprehensive class coverage. These datasets are professionally annotated by pathologists.
If you would like to support, sponsor, or obtain a commercial license for the SPIDER data and models, please contact us at models@hist.ai.
For a detailed description of SPIDER, methodology, and benchmark results, refer to our research paper:
📄 **SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models**
[View on arXiv](https://arxiv.org/abs/2503.02876)
This repository contains the **SPIDER-breast** dataset. To explore datasets for other organs, visit the [Hugging Face HistAI page](https://huggingface.co/histai) or [GitHub](https://github.com/HistAI/SPIDER). SPIDER is regularly updated with new organs and data, so follow us on Hugging Face to stay updated.
---
### Overview
SPIDER-breast is a supervised dataset of image-class pairs for the breast organ. Each data point consists of:
- A **central 224×224 patch** with a class label
- **24 surrounding context patches** of the same size, forming a **composite 1120×1120 region**
- Patches are extracted at **20X magnification**
We provide a **train-test split** for consistent benchmarking. The split is done at the **slide level**, ensuring that patches from the same whole slide image (WSI) do not appear in both training and test sets. Users can also merge and re-split the data as needed.
## How to Use
### Downloading the Dataset
#### Option 1: Using `huggingface_hub`
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="histai/SPIDER-breast", repo_type="dataset", local_dir="/local_path")
```
#### Option 2: Using `git`
```bash
# Ensure you have Git LFS installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/histai/SPIDER-breast
```
### Extracting the Dataset
The dataset is provided in multiple tar archives. Unpack them using:
```bash
cat spider-breast.tar.* | tar -xvf -
```
### Using the Dataset
Once extracted, you will find:
- An `images/` folder
- A `metadata.json` file
You can process and use the dataset in two ways:
#### 1. Directly in Code (Recommended for PyTorch Training)
Use the dataset class provided in `scripts/spider_dataset.py`. This class takes:
- Path to the dataset (folder containing `metadata.json` and `images/` folder)
- Context size: `5`, `3`, or `1`
- `5`: Full **1120×1120** patches (default)
- `3`: **672×672** patches
- `1`: Only central patches
The dataset class dynamically returns stitched images, making it suitable for direct use in PyTorch training pipelines.
#### 2. Convert to ImageNet Format
To structure the dataset for easy use with standard tools, convert it using `scripts/convert_to_imagenet.py`.
The script also supports different context sizes.
This will generate:
```
<output_dir>/<split>/<class>/<slide>/<image>
```
You can then use it with:
```python
from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
```
or
`torchvision.datasets.ImageFolder` class
---
### Dataset Composition
The SPIDER-breast dataset consists of the following classes:
| Class | Total Patches |
|--------------------------------------|---------------|
| Adenosis | 2899 |
| Benign phyllodes tumor | 4526 |
| Ductal carcinoma in situ (high-grade)| 5632 |
| Ductal carcinoma in situ (low-grade) | 5017 |
| Fat | 6286 |
| Fibroadenoma | 5243 |
| Fibrocystic changes | 5027 |
| Fibrosis | 6260 |
| Invasive non-special type carcinoma | 6142 |
| Lipogranuloma | 4941 |
| Lobular invasive carcinoma | 5102 |
| Malignant phyllodes tumor | 5271 |
| Necrosis | 5396 |
| Normal ducts | 4891 |
| Normal lobules | 5821 |
| Sclerosing adenosis | 3423 |
| Typical ductal hyperplasia | 5546 |
| Vessels | 5469 |
**Total Counts:**
- **92,892** central patches
- **984,924** total patches (including context patches)
- **921** total slides used for annotation
---
## License
The dataset is licensed under **CC BY-NC 4.0** and is for **research use only**.
## Citation
If you use this dataset in your work, please cite:
```bibtex
@misc{nechaev2025spidercomprehensivemultiorgansupervised,
title={SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models},
author={Dmitry Nechaev and Alexey Pchelnikov and Ekaterina Ivanova},
year={2025},
eprint={2503.02876},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2503.02876},
}
```
## Contacts
- **Authors:** Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova
- **Email:** dmitry@hist.ai, alex@hist.ai, kate@hist.ai
# SPIDER-BREAST 数据集
SPIDER 是一个覆盖多器官的有监督病理数据集(supervised pathological dataset)集合,具备全面的类别覆盖范围。所有数据集均由病理学家(pathologists)进行专业标注。
若您希望支持、赞助或获取 SPIDER 数据与模型的商业授权,请通过 models@hist.ai 联系我们。
如需了解 SPIDER 的详细说明、研究方法与基准测试结果,请参阅我们的研究论文:
📄 **SPIDER:一款全面的多器官有监督病理数据集与基准模型**
[在 arXiv 查看](https://arxiv.org/abs/2503.02876)
本仓库包含 **SPIDER-breast** 数据集。如需探索其他器官的数据集,请访问 [Hugging Face HistAI 页面](https://huggingface.co/histai) 或 [GitHub 仓库](https://github.com/HistAI/SPIDER)。SPIDER 会定期更新新增器官与数据,欢迎关注 Hugging Face 以获取最新动态。
---
### 数据集概览
SPIDER-breast 是针对乳腺器官的图像-标签对有监督数据集。每条数据包含:
- 一个**中心224×224像素的图像块**及其类别标签
- **24个周边上下文图像块**(尺寸相同),共同构成**复合的1120×1120像素区域**
- 所有图像块均以**20倍放大倍率**提取
我们提供了**训练-测试划分方案**以保证基准测试的一致性。该划分基于**玻片级别(slide level)**进行,确保来自同一张全视野数字切片(Whole Slide Image, WSI)的图像块不会同时出现在训练集与测试集中。用户也可根据需求自行合并或重新划分数据集。
## 使用方法
### 数据集下载
#### 方案1:使用 `huggingface_hub`
python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="histai/SPIDER-breast", repo_type="dataset", local_dir="/local_path")
#### 方案2:使用 `git`
bash
# 请确保已安装 Git LFS (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/histai/SPIDER-breast
### 数据集解压
本数据集以多个tar归档文件形式提供,请使用以下命令解压:
bash
cat spider-breast.tar.* | tar -xvf -
### 数据集使用
解压后,您将得到:
- 一个 `images/` 文件夹
- 一个 `metadata.json` 元数据文件
您可以通过两种方式处理并使用该数据集:
#### 1. 直接在代码中调用(推荐用于PyTorch训练)
使用 `scripts/spider_dataset.py` 中提供的数据集类。该类接收以下参数:
- 数据集路径(包含 `metadata.json` 与 `images/` 文件夹的根目录)
- 上下文尺寸:可选`5`、`3`或`1`
- `5`:完整的**1120×1120像素**图像块(默认值)
- `3`:**672×672像素**图像块
- `1`:仅使用中心图像块
该数据集类会动态返回拼接后的图像,适配直接集成到PyTorch训练流水线中。
#### 2. 转换为ImageNet格式
如需将数据集结构化以适配标准工具,请使用 `scripts/convert_to_imagenet.py` 脚本进行转换。该脚本同样支持不同的上下文尺寸配置。
转换后将生成如下目录结构:
<output_dir>/<split>/<class>/<slide>/<image>
您可通过以下方式加载该数据集:
python
from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
或使用
`torchvision.datasets.ImageFolder` 类
---
### 数据集构成
SPIDER-breast 数据集包含以下类别:
| 类别名称 | 图像块总数 |
|--------------------------------------|---------------|
| 腺病(Adenosis) | 2899 |
| 良性叶状肿瘤(Benign phyllodes tumor) | 4526 |
| 高级别导管原位癌(Ductal carcinoma in situ (high-grade))| 5632 |
| 低级别导管原位癌(Ductal carcinoma in situ (low-grade)) | 5017 |
| 脂肪组织(Fat) | 6286 |
| 纤维腺瘤(Fibroadenoma) | 5243 |
| 纤维囊性变(Fibrocystic changes) | 5027 |
| 纤维化(Fibrosis) | 6260 |
| 非特殊类型浸润性癌(Invasive non-special type carcinoma) | 6142 |
| 脂肪肉芽肿(Lipogranuloma) | 4941 |
| 小叶浸润性癌(Lobular invasive carcinoma) | 5102 |
| 恶性叶状肿瘤(Malignant phyllodes tumor) | 5271 |
| 坏死组织(Necrosis) | 5396 |
| 正常导管(Normal ducts) | 4891 |
| 正常小叶(Normal lobules) | 5821 |
| 硬化性腺病(Sclerosing adenosis) | 3423 |
| 典型导管增生(Typical ductal hyperplasia) | 5546 |
| 血管(Vessels) | 5469 |
**总统计量:**
- **92,892** 张中心图像块
- **984,924** 张总图像块(含上下文图像块)
- 共使用**921**张用于标注的全视野数字切片
---
## 授权协议
本数据集采用 **CC BY-NC 4.0** 协议授权,仅可用于**科研用途**。
## 引用说明
若您在研究中使用了本数据集,请引用以下文献:
bibtex
@misc{nechaev2025spidercomprehensivemultiorgansupervised,
title={SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models},
author={Dmitry Nechaev and Alexey Pchelnikov and Ekaterina Ivanova},
year={2025},
eprint={2503.02876},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2503.02876},
}
## 联系方式
- **作者团队:** Dmitry Nechaev、Alexey Pchelnikov、Ekaterina Ivanova
- **邮箱:** dmitry@hist.ai、alex@hist.ai、kate@hist.ai
提供机构:
maas
创建时间:
2025-05-15



