The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

Name: The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions
Creator: Harvard Dataverse
Published: 2025-02-06 08:55:42
License: 暂无描述

DataCite Commons2025-02-06 更新2025-04-15 收录

下载链接：

https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/DBW86T

下载链接

链接失效反馈

官方服务：

资源简介：

Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (<code>akiec</code>), basal cell carcinoma (<code>bcc</code>), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, <code>bkl</code>), dermatofibroma (<code>df</code>), melanoma (<code>mel</code>), melanocytic nevi (<code>nv</code>) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, <code>vasc</code>). More than 50% of lesions are confirmed through histopathology (<code>histo</code>), the ground truth for the rest of the cases is either follow-up examination (<code>follow_up</code>), expert consensus (<code>consensus</code>), or confirmation by in-vivo confocal microscopy (<code>confocal</code>). The dataset includes lesions with multiple images, which can be tracked by the <code>lesion_id</code>-column within the HAM10000_metadata file. Due to upload size limitations, images are stored in two files: <ul> <li>HAM10000_images_part1.zip (5000 JPEG files)</li> <li>HAM10000_images_part2.zip (5015 JPEG files)</li> </ul> <h3>Additional data for evaluation purposes</h3> The HAM10000 dataset served as the training set for the <a href='http://arxiv.org/abs/1902.03368'>ISIC 2018 challenge (Task 3)</a>, with the same sources contributing the majority of the validation- and test-set as well. The test-set images are available herein as ISIC2018_Task3_Test_Images.zip (1511 images), the ground-truth in the same format as the HAM10000 data (public since 2023) is available as ISIC2018_Task3_Test_GroundTruth.csv.. The ISIC-Archive also provides the challenge images and metadata (training, validation, test) at their <a href="https://challenge.isic-archive.com/data/#2018">"ISIC Challenge Datasets" page</a>. <h3>Comparison to physicians</h3> Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: <a href="https://doi.org/10.1016/S1470-2045(19)30333-X">Tschandl P. et al., Lancet Oncol 2019</a> <h3>Human-computer collaboration</h3> The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: <a href="https://www.nature.com/articles/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a> Following corresponding metadata is available herein: <ul> <li>ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in <a href="https://doi.org/10.1038/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a>, therefore please refer to this publication when using the data. Some details on the abbreviated column headings: <ul> <li>image_id: This is the ISIC image_id of an image at the time of the study. There should be no duplications in the combination image_id & interaction_modality. As not every image was shown with every interaction modality, not every combination is present.</li> <li>prob_m_dx_akiec, ... : *_m_* is "machine probabilities". Values are values after softmax, and "_mal" is all malignant classes summed.</li> <li>prob_h_dx_akiec, ... : *_h_* is "human probabilities". Values are aggregated percentages of human ratings from past studies distinguishing between seven classes. Note there is no "prob_h_mal" as this was none of the tested interaction modalities.</li> <li>user_dx_without_interaction_akiec, ...: Number of participants choosing this diagnosis without interaction.</li> <li>user_dx_with_interaction_akiec, ...: Number of participants choosing this diagnosis with interaction.</li> </ul> </li> <li> HAM10000_segmentations_lesion_tschandl.zip: To evaluate regions of CNN activations in <a href="https://www.nature.com/articles/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a> (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by <a href="https://doi.org/10.1016/j.compbiomed.2018.11.010">Tschandl et al., Computers in Biology and Medicine 2019</a>, and following verified, corrected or replaced via the free-hand selection tool in <a href="https://fiji.sc/">FIJI</a>.</li> </ul>

提供机构：

Harvard Dataverse

创建时间：

2018-06-04