five

SPIDER-colorectal

收藏
魔搭社区2025-11-27 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/histai/SPIDER-colorectal
下载链接
链接失效反馈
官方服务:
资源简介:
# SPIDER-COLORECTAL Dataset SPIDER is a collection of supervised pathological datasets covering multiple organs, each with comprehensive class coverage. These datasets are professionally annotated by pathologists. If you would like to support, sponsor, or obtain a commercial license for the SPIDER data and models, please contact us at models@hist.ai. For a detailed description of SPIDER, methodology, and benchmark results, refer to our research paper: 📄 **SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models** [View on arXiv](https://arxiv.org/abs/2503.02876) This repository contains the **SPIDER-colorectal** dataset. To explore datasets for other organs, visit the [Hugging Face HistAI page](https://huggingface.co/histai) or [GitHub](https://github.com/HistAI/SPIDER). SPIDER is regularly updated with new organs and data, so follow us on Hugging Face to stay updated. --- ### Overview SPIDER-colorectal is a supervised dataset of image-class pairs for the colorectal organ. Each data point consists of: - A **central 224×224 patch** with a class label - **24 surrounding context patches** of the same size, forming a **composite 1120×1120 region** - Patches are extracted at **20X magnification** We provide a **train-test split** for consistent benchmarking. The split is done at the **slide level**, ensuring that patches from the same whole slide image (WSI) do not appear in both training and test sets. Users can also merge and re-split the data as needed. ## How to Use ### Downloading the Dataset #### Option 1: Using `huggingface_hub` ```python from huggingface_hub import snapshot_download snapshot_download(repo_id="histai/SPIDER-colorectal", repo_type="dataset", local_dir="/local_path") ``` #### Option 2: Using `git` ```bash # Ensure you have Git LFS installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/histai/SPIDER-colorectal ``` ### Extracting the Dataset The dataset is provided in multiple tar archives. Unpack them using: ```bash cat spider-colorectal.tar.* | tar -xvf - ``` ### Using the Dataset Once extracted, you will find: - An `images/` folder - A `metadata.json` file You can process and use the dataset in two ways: #### 1. Directly in Code (Recommended for PyTorch Training) Use the dataset class provided in `scripts/spider_dataset.py`. This class takes: - Path to the dataset (folder containing `metadata.json` and `images/` folder) - Context size: `5`, `3`, or `1` - `5`: Full **1120×1120** patches (default) - `3`: **672×672** patches - `1`: Only central patches The dataset class dynamically returns stitched images, making it suitable for direct use in PyTorch training pipelines. #### 2. Convert to ImageNet Format To structure the dataset for easy use with standard tools, convert it using `scripts/convert_to_imagenet.py`. The script also supports different context sizes. This will generate: ``` <output_dir>/<split>/<class>/<slide>/<image> ``` You can then use it with: ```python from datasets import load_dataset dataset = load_dataset("imagefolder", data_dir="/path/to/folder") ``` or `torchvision.datasets.ImageFolder` class --- ### Dataset Composition The SPIDER-colorectal dataset consists of the following classes: | Class | Central Patches | |--------------------------------|------------| | Adenocarcinoma high grade | 6299 | | Adenocarcinoma low grade | 6066 | | Adenoma high grade | 5493 | | Adenoma low grade | 5693 | | Fat | 6081 | | Hyperplastic polyp | 5893 | | Inflammation | 5523 | | Mucus | 5711 | | Muscle | 5866 | | Necrosis | 5481 | | Sessile serrated lesion | 4993 | | Stroma healthy | 8001 | | Vessels | 6082 | **Total Counts:** - **77,182** central patches - **1,039,150** total patches (including context patches) - **1,719** total slides used for annotation --- ## License The dataset is licensed under **CC BY-NC 4.0** and is for **research use only**. ## Citation If you use this dataset in your work, please cite: ```bibtex @misc{nechaev2025spidercomprehensivemultiorgansupervised, title={SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models}, author={Dmitry Nechaev and Alexey Pchelnikov and Ekaterina Ivanova}, year={2025}, eprint={2503.02876}, archivePrefix={arXiv}, primaryClass={eess.IV}, url={https://arxiv.org/abs/2503.02876}, } ``` ## Contacts - **Authors:** Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova - **Email:** dmitry@hist.ai, alex@hist.ai, kate@hist.ai

# SPIDER结直肠数据集 SPIDER是一组覆盖多器官的监督式病理数据集,具备全面的类别覆盖范围,所有数据集均由病理学家进行专业标注。 若您希望对SPIDER数据与模型提供支持、赞助,或获取商业授权,请联系我们:models@hist.ai。 如需了解SPIDER的详细说明、研究方法与基准测试结果,请参阅我们的学术论文:📄 **SPIDER:一款全面的多器官监督病理数据集与基准模型** [在arXiv上查看](https://arxiv.org/abs/2503.02876) 本仓库包含**SPIDER-结直肠**数据集。如需查看其他器官的数据集,请访问[Hugging Face HistAI页面](https://huggingface.co/histai)或[GitHub仓库](https://github.com/HistAI/SPIDER)。SPIDER会持续更新新增器官与数据,欢迎在Hugging Face关注我们以获取最新动态。 --- ### 数据集概览 SPIDER-结直肠数据集是针对结直肠器官的图像-标签对监督数据集,每条数据包含以下内容: - 带有类别标签的**中心224×224图像块** - 24张相同尺寸的周边上下文图像块,共同组成**1120×1120的复合区域** - 所有图像块均提取自**20倍放大倍率**的病理图像 我们提供了**训练集-测试集划分方案**以保障基准测试的一致性,该划分基于**玻片级别**进行,确保同一张全视野数字病理切片(Whole Slide Image, WSI)的图像块不会同时出现在训练集与测试集中。用户也可根据需求自行合并或重新划分数据集。 ## 使用方法 ### 数据集下载 #### 方案1:使用`huggingface_hub`库 python from huggingface_hub import snapshot_download snapshot_download(repo_id="histai/SPIDER-colorectal", repo_type="dataset", local_dir="/local_path") #### 方案2:使用`git`工具 bash # 确保已安装Git LFS(https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/histai/SPIDER-colorectal ### 数据集解压 本数据集以多个tar归档文件形式提供,可通过以下命令解压: bash cat spider-colorectal.tar.* | tar -xvf - ### 数据集使用 解压完成后,您将获得以下内容: - `images/`文件夹 - `metadata.json`元数据文件 您可通过两种方式处理并使用该数据集: #### 1. 直接在代码中调用(推荐用于PyTorch训练) 使用`scripts/spider_dataset.py`中提供的数据集类,该类支持以下参数: - 数据集路径(即包含`metadata.json`与`images/`文件夹的根目录) - 上下文块数量:`5`、`3`或`1` - `5`:完整的**1120×1120**复合图像块(默认选项) - `3`:**672×672**复合图像块 - `1`:仅使用中心图像块 该数据集类会动态返回拼接后的图像,适配直接集成到PyTorch训练流程中。 #### 2. 转换为ImageNet格式 如需将数据集整理为适配标准工具的格式,可使用`scripts/convert_to_imagenet.py`脚本进行转换。该脚本同样支持不同的上下文块数量设置。 转换后将生成如下目录结构: <output_dir>/<split>/<class>/<slide>/<image> 之后您可通过以下方式加载该数据集: python from datasets import load_dataset dataset = load_dataset("imagefolder", data_dir="/path/to/folder") 或使用`torchvision.datasets.ImageFolder`类 --- ### 数据集构成 SPIDER-结直肠数据集包含以下类别: | 类别名称 | 中心图像块数量 | |--------------------------------|------------| | 高级别腺癌 | 6299 | | 低级别腺癌 | 6066 | | 高级别腺瘤 | 5493 | | 低级别腺瘤 | 5693 | | 脂肪组织 | 6081 | | 增生性息肉 | 5893 | | 炎症组织 | 5523 | | 黏液 | 5711 | | 肌肉组织 | 5866 | | 坏死组织 | 5481 | | 无蒂锯齿状病变 | 4993 | | 健康间质 | 8001 | | 血管 | 6082 | **总样本量:** - **77182**张中心图像块 - **1039150**张总图像块(含上下文图像块) - 共使用**1719**张病理玻片进行标注。 --- ## 授权协议 本数据集采用**CC BY-NC 4.0**协议授权,仅可用于**学术研究用途**。 ## 引用声明 若您在研究工作中使用本数据集,请引用以下文献: bibtex @misc{nechaev2025spidercomprehensivemultiorgansupervised, title={SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models}, author={Dmitry Nechaev and Alexey Pchelnikov and Ekaterina Ivanova}, year={2025}, eprint={2503.02876}, archivePrefix={arXiv}, primaryClass={eess.IV}, url={https://arxiv.org/abs/2503.02876}, } ## 联系方式 - **作者**:Dmitry Nechaev、Alexey Pchelnikov、Ekaterina Ivanova - **邮箱**:dmitry@hist.ai、alex@hist.ai、kate@hist.ai
提供机构:
maas
创建时间:
2025-05-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作