five

CPIA Dataset_Part08: A Comprehensive Pathological Image Analysis Dataset for Self-supervised Learning Pre-training

收藏
科学数据银行2024-04-16 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=bf97c631e5034cde848466408ec42ae0
下载链接
链接失效反馈
官方服务:
资源简介:
Pathological image analysis is a crucial field in computer-aided diagnosis. Transfer learning using models initialized on natural images has improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, overcoming the challenge of expensive annotations. Thus, this field calls for a comprehensive dataset, similar to the ImageNet in computer vision. This paper presents a large-scale comprehensive pathological image analysis (CPIA) dataset for SSL pre-training. The CPIA dataset contains 148,962,579 images, covering over 48 organs/tissues and approximately 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and characteristic regions of interest (ROIs). And we establish a multi-scale pathological data processing workflow, combined with the diagnosis habits of senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. This is the Part08 of CPIA dataset, including the CPIA-Mini and partial CPIA dataset. The related code and information are available at https://github.com/zhanglab2021/CPIA_Dataset.

病理图像分析是计算机辅助诊断领域的核心研究方向。采用在自然图像上预初始化的模型开展迁移学习,已有效提升了下游病理分析任务的性能,但缺乏适配病理领域的专业化预初始化方案,限制了此类模型的应用潜力。自监督学习(Self-supervised Learning, SSL)无需样本级标注即可完成预训练,有效解决了标注成本高昂的行业痛点。因此该领域亟需一套类似计算机视觉领域ImageNet的大规模综合病理数据集。本文提出了一款面向自监督预训练的大规模综合病理图像分析数据集(Comprehensive Pathological Image Analysis, CPIA),该数据集共包含148,962,579张病理图像,覆盖48余种器官/组织类型与近100种疾病类别,涵盖两种核心数据模态:全切片图像(Whole Slide Images, WSIs)与特征感兴趣区域(Regions of Interest, ROIs)。研究团队结合资深病理医师的临床诊断习惯,搭建了多尺度病理数据处理流程,该数据集可助力实现全面的病理认知,并支持病理模式挖掘相关探索研究。此外,为支撑CPIA数据集的应用,本文还专门实现了多款当前最优(State-of-the-Art, SOTA)的自监督预训练模型基线与下游任务评估方案。本数据包为CPIA数据集的第8分卷,包含CPIA-Mini子集与部分完整CPIA数据集内容。相关代码与数据集信息可访问:https://github.com/zhanglab2021/CPIA_Dataset。
提供机构:
Yunlu Feng; Guanglei Zhang; Tianyi Zhang; Peking Union Medical College Hospital; Zeyu Liu; Yanli Lei; Sicheng Chen; Shangqing Lyu; Beihang University
创建时间:
2024-02-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作