Balanced and Augmented Version of the HAM10000 Skin Lesion Dataset (Derived & Corrected)
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/hpcf9psdy7
下载链接
链接失效反馈官方服务:
资源简介:
Overview
The HAM10000 (“Human Against Machine with 10,000 training images”) dataset is one of the most widely used collections for skin lesion classification. It contains 10,015 dermatoscopic images categorized into seven diagnostic classes: Melanocytic Nevi (NV), Melanoma (MEL), Benign Keratosis (BKL), Basal Cell Carcinoma (BCC), Actinic Keratoses (AKIEC), Vascular Lesions (VASC), and Dermatofibroma (DF).
One of the major challenges in the original HAM10000 dataset is its highly imbalanced class distribution. The NV class alone makes up about 67% of all samples, while the minority classes DF and VASC together represent less than 3%. This imbalance leads to biased models that perform well on common classes but poorly on rare lesions. Many researchers tried to fix this by heavily upsampling small classes for example, increasing a 150-image class to over 1000 samples which often makes models overfitted and unrealistic.
This derived version was created to offer a balanced and scientifically responsible alternative. It uses a combination of undersampling for large classes and controlled augmentation for small ones. A target of roughly 500–650 training samples per class was selected to maintain fairness while preserving data diversity.
Larger classes such as NV, MEL, and BKL were undersampled to around 500 samples each to prevent majority dominance. Smaller classes like AKIEC, DF, and VASC were augmented carefully using realistic transformations such as random rotations (±30°), horizontal/vertical flips, scaling (0.8–1.2×), brightness and contrast adjustments (±20%), and mild Gaussian noise. This ensured that no class was artificially inflated or distorted.
The final dataset structure is as follows:
AKIEC – Train: 654, Test: 150
BCC – Train: 500, Test: 150
BKL – Train: 500, Test: 150
DF – Train: 537, Test: 115
MEL – Train: 500, Test: 150
NV – Train: 500, Test: 150
VASC – Train: 568, Test: 142
These numbers create a nearly uniform and balanced dataset without losing important image diversity. The slight variations between class sizes (500–650) are intentional and defendable. They help preserve genuine data from minority classes while preventing excessive synthetic augmentation. Forcing all classes to have an identical count could remove valuable real samples or produce too many artificial images, which would reduce model generalization.
This version provides a fair, realistic, and reproducible dataset for training skin lesion classification models. It reduces overfitting, improves class-level balance, and ensures better generalization. Researchers can confidently use this dataset to evaluate fairness and robustness in medical image classification tasks.
创建时间:
2025-10-27



