ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology
收藏DataCite Commons2026-03-23 更新2024-07-13 收录
下载链接:
https://researchdata.se/catalogue/dataset/2022-190-1/1
下载链接
链接失效反馈官方服务:
资源简介:
The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data table with one row for each WSI, specifying an anonymised patient ID, the stain or IHC antibody type of each WSI, as well as the magnification and microns per pixel at each available resolution level. Automated registration algorithm performance evaluation is possible through the ACROBAT challenge website based on over 37,000 landmark pair annotations from 13 annotators. While the primary purpose of this data set was the development and evaluation of WSI registration methods, this data set has the potential to facilitate further research in the context of computational pathology, for example in the areas of stain-guided learning, virtual staining, unsupervised learning and stain-independent models.
The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available.
WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.org/). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiff/) at 10X magnification and lower image levels.
The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zip (71.47 GB), train_part2.zip (70.59 GB), train_part3.zip (75.91 GB), train_part4.zip (71.63 GB) and train_part5.zip (69.09 GB)), one for the validation data (valid.zip 21.79 GB) and one for the test data (test.zip 68.11 GB).
File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading.
While it would be helpful to notify SND of any publications using this data set by sending an email to request@snd.gu.se, please note that this is not required to use the data.
ACROBAT数据集共包含来自1153名女性原发性乳腺癌患者的4212张全视野数字切片(Whole Slide Images, WSIs)。本数据集内的WSIs均为10倍放大倍率,涵盖经苏木精-伊红(Hematoxylin and Eosin, H&E)染色或免疫组化(Immunohistochemistry, IHC)处理的乳腺癌切除标本组织切片。针对每位患者,提供1张H&E染色组织的WSI,以及对应组织经常规诊断染色(ER、PGR、HER2及KI67)的WSI,数量为至少1张、最多4张。
本数据集作为CHIME研究(chimestudy.se)的一部分构建,其核心初衷是支持ACROBAT全视野数字切片配准挑战赛(acrobat.grand-challenge.org)。所有组织病理切片均来自常规诊断病理流程,由瑞典斯德哥尔摩卡罗林斯卡学院(Karolinska Institutet)为科研目的完成数字化。图像采集流程贴合常规数字病理图像数字化工作流,使用3台不同型号的滨松(Hamamatsu)WSI扫描仪,分别为1台NanoZoomer S360与2台NanoZoomer XR。
本数据集附带一张数据表,每张WSI对应一行记录,标注了匿名化患者ID、每张WSI的染色类型或IHC抗体种类,以及各可用分辨率层级的放大倍率和每像素微米数。基于13名标注者标注的超过37000个地标对注释,可通过ACROBAT挑战赛官网实现自动化配准算法的性能评估。尽管本数据集的初始用途是开发与评估WSI配准方法,但其同样可支撑计算病理学领域的多项后续研究,例如染色引导学习、虚拟染色、无监督学习以及染色无关模型等方向。
本数据集按照ACROBAT WSI配准挑战赛的划分,分为训练集、验证集与测试集三个子集。训练集包含750例病例,每例均配有1张H&E染色WSI与1至4张IHC染色WSI,总计3406张WSI。验证集包含100例病例,总计200张WSI;测试集包含303例病例,总计606张WSI。验证集与测试集均提供1张H&E染色WSI与1张随机选取的IHC染色WSI。
WSIs的匿名化流程包括删除关联的宏观图像、生成包含随机病例ID的文件名,以及覆盖包含潜在个人信息的元数据字段。滨松NDPI格式文件随后通过libvips(libvips.org/)进行格式转换。本数据集的WSIs以通用分块TIFF格式WSI(openslide.org/formats/generic-tiff/)提供,支持10倍放大倍率及更低的图像层级。
本数据集通过7个独立的ZIP压缩包提供下载:训练数据分为5个压缩包(train_part1.zip(71.47 GB)、train_part2.zip(70.59 GB)、train_part3.zip(75.91 GB)、train_part4.zip(71.63 GB)、train_part5.zip(69.09 GB)),验证数据对应1个压缩包(valid.zip 21.79 GB),测试数据对应1个压缩包(test.zip 68.11 GB)。
可通过SHA1格式的文件列表与校验和,在下载时验证压缩包及数据的完整性。
若使用本数据集产出学术论文,可通过发送邮件至request@snd.gu.se通知瑞典国家数据服务中心(SND),但该操作并非使用数据集的强制要求。
提供机构:
Karolinska Institutet
创建时间:
2023-01-02
搜集汇总
数据集介绍

背景与挑战
背景概述
ACROBAT是一个用于计算病理学的多染色乳腺癌组织学全切片图像数据集,包含4212张图像(来自1153名患者),涵盖H&E和免疫组化染色,旨在支持WSI注册挑战和数字病理学研究,如染色引导学习和虚拟染色。数据集已分为训练、验证和测试子集,总大小约448.6 GiB,来自2012-2018年斯德哥尔摩地区的临床诊断样本,并采用匿名化和开放许可。
以上内容由遇见数据集搜集并总结生成



