five

Balanced and Augmented Version of the HAM10000 Skin Lesion Dataset (Derived & Corrected)

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/hpcf9psdy7
下载链接
链接失效反馈
官方服务:
资源简介:
Overview The HAM10000 (“Human Against Machine with 10,000 training images”) dataset is one of the most widely used collections for skin lesion classification. It contains 10,015 dermatoscopic images categorized into seven diagnostic classes: Melanocytic Nevi (NV), Melanoma (MEL), Benign Keratosis (BKL), Basal Cell Carcinoma (BCC), Actinic Keratoses (AKIEC), Vascular Lesions (VASC), and Dermatofibroma (DF). One of the major challenges in the original HAM10000 dataset is its highly imbalanced class distribution. The NV class alone makes up about 67% of all samples, while the minority classes DF and VASC together represent less than 3%. This imbalance leads to biased models that perform well on common classes but poorly on rare lesions. Many researchers tried to fix this by heavily upsampling small classes for example, increasing a 150-image class to over 1000 samples which often makes models overfitted and unrealistic. This derived version was created to offer a balanced and scientifically responsible alternative. It uses a combination of undersampling for large classes and controlled augmentation for small ones. A target of roughly 500–650 training samples per class was selected to maintain fairness while preserving data diversity. Larger classes such as NV, MEL, and BKL were undersampled to around 500 samples each to prevent majority dominance. Smaller classes like AKIEC, DF, and VASC were augmented carefully using realistic transformations such as random rotations (±30°), horizontal/vertical flips, scaling (0.8–1.2×), brightness and contrast adjustments (±20%), and mild Gaussian noise. This ensured that no class was artificially inflated or distorted. The final dataset structure is as follows: AKIEC – Train: 654, Test: 150 BCC – Train: 500, Test: 150 BKL – Train: 500, Test: 150 DF – Train: 537, Test: 115 MEL – Train: 500, Test: 150 NV – Train: 500, Test: 150 VASC – Train: 568, Test: 142 These numbers create a nearly uniform and balanced dataset without losing important image diversity. The slight variations between class sizes (500–650) are intentional and defendable. They help preserve genuine data from minority classes while preventing excessive synthetic augmentation. Forcing all classes to have an identical count could remove valuable real samples or produce too many artificial images, which would reduce model generalization. This version provides a fair, realistic, and reproducible dataset for training skin lesion classification models. It reduces overfitting, improves class-level balance, and ensures better generalization. Researchers can confidently use this dataset to evaluate fairness and robustness in medical image classification tasks.

概述 HAM10000(Human Against Machine with 10,000 training images)是皮肤病变分类任务中应用最广泛的数据集之一。该数据集包含10015张皮肤镜图像,共分为7个诊断类别:黑素细胞痣(Melanocytic Nevi, NV)、黑色素瘤(Melanoma, MEL)、良性角化病(Benign Keratosis, BKL)、基底细胞癌(Basal Cell Carcinoma, BCC)、光化性角化病(Actinic Keratoses, AKIEC)、血管性病变(Vascular Lesions, VASC)以及皮肤纤维瘤(Dermatofibroma, DF)。 原始HAM10000数据集的主要挑战之一是类别分布严重失衡。仅NV类就占总样本量的约67%,而少数类DF和VASC合计占比不足3%。这种类别失衡会导致模型产生偏差:在常见类别上表现优异,但在罕见病变上性能极差。诸多研究者尝试通过大量上采样少数类来解决该问题(例如将仅含150张样本的类别扩充至1000余张),但这往往会使模型过拟合,且不符合真实数据分布。 本衍生数据集旨在提供一种均衡且符合科学规范的替代方案,通过对多数类进行下采样、对少数类进行可控数据增强的组合策略实现数据均衡。本数据集设定每类训练样本的目标数量为500~650张,在保证类别公平性的同时保留数据多样性。 针对NV、MEL、BKL等多数类,通过下采样将其样本量调整至约500张,以避免主导类压制其他类别;针对AKIEC、DF、VASC等少数类,则采用合理的真实图像变换进行可控增强,变换方式包括随机旋转(±30°)、水平/垂直翻转、缩放(0.8~1.2倍)、亮度与对比度调整(±20%)以及轻度高斯噪声添加。该策略确保不会过度人工扩充或扭曲样本分布。 最终数据集的结构如下: AKIEC:训练集654张,测试集150张 BCC:训练集500张,测试集150张 BKL:训练集500张,测试集150张 DF:训练集537张,测试集115张 MEL:训练集500张,测试集150张 NV:训练集500张,测试集150张 VASC:训练集568张,测试集142张 上述样本量构建了一个近乎均匀均衡的数据集,同时未丢失重要的图像多样性。各类别样本量(500~650张)之间存在小幅差异是有意为之且合理的:这既有助于保留少数类的真实样本,又避免了过度的人工数据增强。若强制所有类别样本量完全一致,可能会剔除有价值的真实样本或生成过多人工样本,进而降低模型的泛化能力。 本版本为皮肤病变分类模型的训练提供了一个公平、贴合真实分布且可复现的数据集。它能够缓解模型过拟合问题,提升类别级别的均衡性,并确保模型具备更优的泛化能力。研究者可放心使用该数据集评估医学图像分类任务中的模型公平性与鲁棒性。
创建时间:
2025-10-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作