five

HistoSet-5×14: A Collection of Balanced Multi-Organ Histopathology Datasets

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/nc8k63t7mp
下载链接
链接失效反馈
官方服务:
资源简介:
HistoSet-5×14 is a secondary, derived histopathological image collection created by aggregating, standardizing, and rebalancing images from multiple publicly available and peer reviewed histopathology datasets. The collection covers five organs, namely breast, colon, lung, oral cavity, and ovary, and consists of fourteen cancerous and non cancerous tissue classes. The dataset is intended to support multi class and multi organ machine learning research in computational pathology. The images included in HistoSet-5×14 originate from four established sources. These include the Lung and Colon Cancer Histopathological Image Dataset (LC25000) introduced by Borkowski et al. (2019), the Breast Cancer Histopathological Image Classification dataset by Spanhol et al. (2015), the Oral Cancer Histopathological Imaging Database by Rahman et al. (2020), and the Ovarian Cancer Histopathology dataset proposed by Kasture et al. (2021). All source datasets consist of Hematoxylin and Eosin stained histopathological images that are de identified and publicly available for research use. Since the original datasets exhibit substantial class imbalance and heterogeneous sample sizes, HistoSet-5×14 applies a standardized preprocessing pipeline involving controlled data augmentation for under-represented classes and down-sampling of over-represented classes. Each class was normalized to 2,000 images, yielding a total of 28,000 images. Augmentation was performed conservatively to preserve diagnostically relevant morphological patterns while improving class balance and model robustness. Source Dataset References: 1. Borkowski, A.A., et al. (2019). Lung and colon cancer histopathological image dataset (LC25000). arXiv:1912.12142. 2. Spanhol, F.A., et al. (2015). A dataset for breast cancer histopathological image classification. IEEE Transactions on Biomedical Engineering, 63(7), 1455–1462. 3. Rahman, T.Y., et al. (2020). Histopathological imaging database for oral cancer analysis. Data in Brief, 29, 105114. 4. Kasture, K.R., et al. (2021). A new deep learning method for automatic ovarian cancer prediction & subtype classification. Turkish Journal of Computer and Mathematics Education, 12(12), 1233–1242.

HistoSet-5×14是一款二级衍生病理组织学图像数据集,通过整合、标准化并重新平衡多个公开可获取且经同行评审的病理数据集图像构建而成。该数据集涵盖乳腺、结肠、肺、口腔与卵巢5类器官,包含14个癌性与非癌性组织类别,旨在支撑计算病理学领域的多类别、多器官机器学习研究。 HistoSet-5×14所包含的图像源自4个成熟数据源:由Borkowski等人(2019年)提出的肺与结肠癌症病理组织学图像数据集(Lung and Colon Cancer Histopathological Image Dataset,LC25000)、Spanhol等人(2015年)发布的乳腺癌症病理组织学图像分类数据集、Rahman等人(2020年)构建的口腔癌症病理组织学成像数据库,以及Kasture等人(2021年)提出的卵巢癌症病理组织学数据集。所有源数据集均采用苏木精-伊红(Hematoxylin and Eosin,HE)染色的病理组织学图像,且均已完成去标识化处理,可公开用于科研用途。 由于原始数据集存在显著的类别不平衡问题且样本量异质性较强,HistoSet-5×14采用标准化预处理流程:对样本量不足的类别实施可控的数据增强,对样本量过剩的类别进行下采样操作。每个类别均被标准化为2000张图像,最终总样本量达28000张。本次数据增强操作采用保守策略,在保留与诊断相关的形态学特征的同时,优化类别平衡度并提升模型鲁棒性。 源数据集参考文献: 1. Borkowski, A.A.等(2019). 肺与结肠癌症病理组织学图像数据集(LC25000). arXiv:1912.12142. 2. Spanhol, F.A.等(2015). 用于乳腺癌症病理组织学图像分类的数据集. 《IEEE生物医学工程汇刊》,63(7),1455–1462. 3. Rahman, T.Y.等(2020). 口腔癌症分析用病理组织学成像数据库. 《数据简报》,29,105114. 4. Kasture, K.R.等(2021). 一种用于卵巢癌自动预测及亚型分类的新型深度学习方法. 《土耳其计算机与数学教育期刊》,12(12),1233–1242.
创建时间:
2025-12-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作