下载链接：

https://modelscope.cn/datasets/histai/SPIDER-breast

下载链接

链接失效反馈

官方服务：

资源简介：

# SPIDER-BREAST Dataset SPIDER is a collection of supervised pathological datasets covering multiple organs, each with comprehensive class coverage. These datasets are professionally annotated by pathologists. If you would like to support, sponsor, or obtain a commercial license for the SPIDER data and models, please contact us at models@hist.ai. For a detailed description of SPIDER, methodology, and benchmark results, refer to our research paper: 📄 **SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models** [View on arXiv](https://arxiv.org/abs/2503.02876) This repository contains the **SPIDER-breast** dataset. To explore datasets for other organs, visit the [Hugging Face HistAI page](https://huggingface.co/histai) or [GitHub](https://github.com/HistAI/SPIDER). SPIDER is regularly updated with new organs and data, so follow us on Hugging Face to stay updated. --- ### Overview SPIDER-breast is a supervised dataset of image-class pairs for the breast organ. Each data point consists of: - A **central 224×224 patch** with a class label - **24 surrounding context patches** of the same size, forming a **composite 1120×1120 region** - Patches are extracted at **20X magnification** We provide a **train-test split** for consistent benchmarking. The split is done at the **slide level**, ensuring that patches from the same whole slide image (WSI) do not appear in both training and test sets. Users can also merge and re-split the data as needed. ## How to Use ### Downloading the Dataset #### Option 1: Using `huggingface_hub` ```python from huggingface_hub import snapshot_download snapshot_download(repo_id="histai/SPIDER-breast", repo_type="dataset", local_dir="/local_path") ``` #### Option 2: Using `git` ```bash # Ensure you have Git LFS installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/histai/SPIDER-breast ``` ### Extracting the Dataset The dataset is provided in multiple tar archives. Unpack them using: ```bash cat spider-breast.tar.* | tar -xvf - ``` ### Using the Dataset Once extracted, you will find: - An `images/` folder - A `metadata.json` file You can process and use the dataset in two ways: #### 1. Directly in Code (Recommended for PyTorch Training) Use the dataset class provided in `scripts/spider_dataset.py`. This class takes: - Path to the dataset (folder containing `metadata.json` and `images/` folder) - Context size: `5`, `3`, or `1` - `5`: Full **1120×1120** patches (default) - `3`: **672×672** patches - `1`: Only central patches The dataset class dynamically returns stitched images, making it suitable for direct use in PyTorch training pipelines. #### 2. Convert to ImageNet Format To structure the dataset for easy use with standard tools, convert it using `scripts/convert_to_imagenet.py`. The script also supports different context sizes. This will generate: ``` <output_dir>/<split>/<class>/<slide>/<image> ``` You can then use it with: ```python from datasets import load_dataset dataset = load_dataset("imagefolder", data_dir="/path/to/folder") ``` or `torchvision.datasets.ImageFolder` class --- ### Dataset Composition The SPIDER-breast dataset consists of the following classes: | Class | Total Patches | |--------------------------------------|---------------| | Adenosis | 2899 | | Benign phyllodes tumor | 4526 | | Ductal carcinoma in situ (high-grade)| 5632 | | Ductal carcinoma in situ (low-grade) | 5017 | | Fat | 6286 | | Fibroadenoma | 5243 | | Fibrocystic changes | 5027 | | Fibrosis | 6260 | | Invasive non-special type carcinoma | 6142 | | Lipogranuloma | 4941 | | Lobular invasive carcinoma | 5102 | | Malignant phyllodes tumor | 5271 | | Necrosis | 5396 | | Normal ducts | 4891 | | Normal lobules | 5821 | | Sclerosing adenosis | 3423 | | Typical ductal hyperplasia | 5546 | | Vessels | 5469 | **Total Counts:** - **92,892** central patches - **984,924** total patches (including context patches) - **921** total slides used for annotation --- ## License The dataset is licensed under **CC BY-NC 4.0** and is for **research use only**. ## Citation If you use this dataset in your work, please cite: ```bibtex @misc{nechaev2025spidercomprehensivemultiorgansupervised, title={SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models}, author={Dmitry Nechaev and Alexey Pchelnikov and Ekaterina Ivanova}, year={2025}, eprint={2503.02876}, archivePrefix={arXiv}, primaryClass={eess.IV}, url={https://arxiv.org/abs/2503.02876}, } ``` ## Contacts - **Authors:** Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova - **Email:** dmitry@hist.ai, alex@hist.ai, kate@hist.ai

# SPIDER-BREAST 数据集 SPIDER 是一个覆盖多器官的有监督病理数据集（supervised pathological dataset）集合，具备全面的类别覆盖范围。所有数据集均由病理学家（pathologists）进行专业标注。若您希望支持、赞助或获取 SPIDER 数据与模型的商业授权，请通过 models@hist.ai 联系我们。如需了解 SPIDER 的详细说明、研究方法与基准测试结果，请参阅我们的研究论文： 📄 **SPIDER：一款全面的多器官有监督病理数据集与基准模型** [在 arXiv 查看](https://arxiv.org/abs/2503.02876) 本仓库包含 **SPIDER-breast** 数据集。如需探索其他器官的数据集，请访问 [Hugging Face HistAI 页面](https://huggingface.co/histai) 或 [GitHub 仓库](https://github.com/HistAI/SPIDER)。SPIDER 会定期更新新增器官与数据，欢迎关注 Hugging Face 以获取最新动态。 --- ### 数据集概览 SPIDER-breast 是针对乳腺器官的图像-标签对有监督数据集。每条数据包含： - 一个**中心224×224像素的图像块**及其类别标签 - **24个周边上下文图像块**（尺寸相同），共同构成**复合的1120×1120像素区域** - 所有图像块均以**20倍放大倍率**提取我们提供了**训练-测试划分方案**以保证基准测试的一致性。该划分基于**玻片级别（slide level）**进行，确保来自同一张全视野数字切片（Whole Slide Image, WSI）的图像块不会同时出现在训练集与测试集中。用户也可根据需求自行合并或重新划分数据集。 ## 使用方法 ### 数据集下载 #### 方案1：使用 `huggingface_hub` python from huggingface_hub import snapshot_download snapshot_download(repo_id="histai/SPIDER-breast", repo_type="dataset", local_dir="/local_path") #### 方案2：使用 `git` bash # 请确保已安装 Git LFS (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/histai/SPIDER-breast ### 数据集解压本数据集以多个tar归档文件形式提供，请使用以下命令解压： bash cat spider-breast.tar.* | tar -xvf - ### 数据集使用解压后，您将得到： - 一个 `images/` 文件夹 - 一个 `metadata.json` 元数据文件您可以通过两种方式处理并使用该数据集： #### 1. 直接在代码中调用（推荐用于PyTorch训练）使用 `scripts/spider_dataset.py` 中提供的数据集类。该类接收以下参数： - 数据集路径（包含 `metadata.json` 与 `images/` 文件夹的根目录） - 上下文尺寸：可选`5`、`3`或`1` - `5`：完整的**1120×1120像素**图像块（默认值） - `3`：**672×672像素**图像块 - `1`：仅使用中心图像块该数据集类会动态返回拼接后的图像，适配直接集成到PyTorch训练流水线中。 #### 2. 转换为ImageNet格式如需将数据集结构化以适配标准工具，请使用 `scripts/convert_to_imagenet.py` 脚本进行转换。该脚本同样支持不同的上下文尺寸配置。转换后将生成如下目录结构： <output_dir>/<split>/<class>/<slide>/<image> 您可通过以下方式加载该数据集： python from datasets import load_dataset dataset = load_dataset("imagefolder", data_dir="/path/to/folder") 或使用 `torchvision.datasets.ImageFolder` 类 --- ### 数据集构成 SPIDER-breast 数据集包含以下类别： | 类别名称 | 图像块总数 | |--------------------------------------|---------------| | 腺病（Adenosis） | 2899 | | 良性叶状肿瘤（Benign phyllodes tumor） | 4526 | | 高级别导管原位癌（Ductal carcinoma in situ (high-grade)）| 5632 | | 低级别导管原位癌（Ductal carcinoma in situ (low-grade)） | 5017 | | 脂肪组织（Fat） | 6286 | | 纤维腺瘤（Fibroadenoma） | 5243 | | 纤维囊性变（Fibrocystic changes） | 5027 | | 纤维化（Fibrosis） | 6260 | | 非特殊类型浸润性癌（Invasive non-special type carcinoma） | 6142 | | 脂肪肉芽肿（Lipogranuloma） | 4941 | | 小叶浸润性癌（Lobular invasive carcinoma） | 5102 | | 恶性叶状肿瘤（Malignant phyllodes tumor） | 5271 | | 坏死组织（Necrosis） | 5396 | | 正常导管（Normal ducts） | 4891 | | 正常小叶（Normal lobules） | 5821 | | 硬化性腺病（Sclerosing adenosis） | 3423 | | 典型导管增生（Typical ductal hyperplasia） | 5546 | | 血管（Vessels） | 5469 | **总统计量：** - **92,892** 张中心图像块 - **984,924** 张总图像块（含上下文图像块） - 共使用**921**张用于标注的全视野数字切片 --- ## 授权协议本数据集采用 **CC BY-NC 4.0** 协议授权，仅可用于**科研用途**。 ## 引用说明若您在研究中使用了本数据集，请引用以下文献： bibtex @misc{nechaev2025spidercomprehensivemultiorgansupervised, title={SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models}, author={Dmitry Nechaev and Alexey Pchelnikov and Ekaterina Ivanova}, year={2025}, eprint={2503.02876}, archivePrefix={arXiv}, primaryClass={eess.IV}, url={https://arxiv.org/abs/2503.02876}, } ## 联系方式 - **作者团队：** Dmitry Nechaev、Alexey Pchelnikov、Ekaterina Ivanova - **邮箱：** dmitry@hist.ai、alex@hist.ai、kate@hist.ai

应用场景：