SMART-OM: A SMARTphone based expert annotated dataset of Oral Mucosa images
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/SMART-OM_A_SMARTphone_based_expert_annotated_dataset_of_Oral_Mucosa_images/31341790
下载链接
链接失效反馈官方服务:
资源简介:
Abstract
In this study, we introduce a dataset of oral lesion images to advance the development of automated lesion classification models using machine learning (ML) and artificial intelligence (AI) techniques for early detection and diagnosis of oral cancer (OC) and oral potentially malignant disorders (OPMD). The dataset comprises 2469 RGB images collected from 331 subjects, systematically categorized into four distinct classes: healthy/normal, variations from normal, OPMD, and OC. These images of the oral cavity are captured using standard Android and iOS smartphone cameras under real-world clinical conditions in the presence of visible light. Each image of the dataset is annotated by expert dental surgeons using polygonal contours to demarcate the oral cavity and lesion boundaries. The annotation is performed in a semi-automated manner using the VGG image annotator tool. Clinical data is recorded using a customized Jotform, and detailed patient metadata, including age, sex, clinical diagnosis, and lifestyle-based risk indicators such as smoking and smokeless tobacco usage, alcohol consumption, and areca nut chewing, is compiled. All data collection and handling procedures adhered to ethical guidelines outlined in the Declaration of Helsinki and its amendments for research involving human subjects, with informed consent obtained from each subject. The final dataset, including annotated images and metadata, is stored in a standardized JSON format. This curated dataset is intended to support researchers in developing cutting-edge ML/AI-based solutions for improved oral lesion diagnosis and cancer screening.
Usage NotesOral Tissue ClassThe dataset is organized into four directories. Each directory corresponds to a specific class label of oral tissue pathology and contains associated image data as well as detailed annotations. The description of class labels is as follows:
Normal – representing healthy oral tissues,Variation from Normal – comprising tissues that exhibit deviations from typical morphology but are not classified as pathological,OPMD – encompassing Oral Potentially Malignant Disorders, andOral Cancer – comprising clinically and histologically diagnosed malignant oral lesions.Oral Tissue Annotation LevelEach class directory is further divided into four sub-directories based on the level of annotation available:
Unannotated – contains raw images without any annotation.Region Annotation – includes images with annotations highlighting specific sites of interest.Full Annotation – comprises images with comprehensive annotations covering all intra-oral soft and hard tissues.Lesion Annotation – focuses on lesions annotated within the images.Note: The Normal directory does not include the Lesion Annotation subdirectory, as normal tissues do not contain lesions.Within all subdirectories, including Unannotated, images are categorized based on eight anatomical sites:
Dorsal tongueVentral tongueLeft buccal mucosaRight buccal mucosaUpper lipLower lipUpper archLower archNote: Certain classes might be missing one or more anatomical region directories, depending on image availability.
Annotation FormatThe images in the dataset were saved in JPEG format, with filenames structured as A_B_C.JPG.
Images are annotated using an open-source annotation tool, VGG Image Annotator(VIA). Annotations for the images are stored in JSON (JavaScript Object Notation) files, following the naming convention A_B_D.json. These JSON files contain structured information outlining labeled regions, typically in the form of polygon-shaped annotations.
Here, "A" denotes a unique anonymized alphanumeric patient ID, "B" indicates the location where the images were captured (R for Ranipet and W for the World Vision camp), "C" specifies one of eight the intraoral site using a two-letter code (DT for dorsal tongue, VT for ventral tongue, LB for left buccal mucosa, RB for right buccal mucosa, UL for upper lip, LL for lower lip, UA for upper dental arch, and LA for lower dental arch), and "D" indicates the level of annotation.
Key features of the annotation files:
Each JSON file encodes spatial coordinates that define the annotated regions.Annotations can be visualized as overlays on the corresponding images using the VGG image annotator.The JSON files are available as a separate download, enabling flexible dataset usage.Note: The Unannotated subdirectory does not contain JSON files, as it consists only of raw images.
Descriptors FormatThe Descriptors directory contains XLSX files corresponding to each annotation level.
Each file includes separate Excel sheets for different tissue classes.
Images are annotated using VGG Image Annotator, relevant regions are annotated using polygons, and each polygon is associated with a numeric identifier. The labels of these regions can be mapped using the appropriate Descriptor XLSX files.
Within each sheet, the columns represent the following:
S.No: Serial number of the entryFile_name: Name of the annotated image fileLabel_ID: Expert labels that map to the corresponding polygonal annotation number in the image (1, 2, 3, ...).The descriptors, along with the annotations, provide features that can be used to develop machine learning models for early diagnosis in oral pathology.
Demographic, Habit History & Clinical MetadataIn the Meta Data directory, an Excel file that contains demographics and clinical data is provided. It comprises two sheets,' Demographics & personal history' and 'Clinical findings'. Each row provides the data of one patient with SMITA_ID as the key. The objective of providing metadata along with image annotations is to facilitate multi-modal learning.
Directory StructureThe detailed directory structure is presented below.
SMART-OM/
├── 01. Normal/
│ ├── 01. Unannotated/
│ │ ├── 01. Dorsal tongue/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 05. Upper lip/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ └── 08. Lower arch/
│ ├── 02. Region annotation/
│ │ ├── 01. Dorsal tongue/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 05. Upper lip/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ ├── 08. Lower arch/
│ │ └── 09. Json files/
│ └── 03. Full annotation/
│ ├── 01. Dorsal tongue/
│ ├── 02. Ventral tongue/
│ ├── 03. Left buccal mucosa/
│ ├── 04. Right buccal mucosa/
│ ├── 05. Upper lip/
│ ├── 06. Lower lip/
│ ├── 07. Upper arch/
│ ├── 08. Lower arch/
│ └── 09. Json files/
├── 02. Variation from normal/
│ ├── 01. Unannotated/
│ │ ├── 01. Dorsal tongue/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 06. Lower lip/
│ │ └── 07. Upper arch/
│ ├── 02. Region annotation/
│ │ ├── 01. Dorsal tongue/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ └── region json/
│ ├── 03. Full annotation/
│ │ ├── 01. Dorsal tongue/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ └── full json/
│ └── 04. Lesion annotation/
│ ├── 01. Dorsal tongue/
│ ├── 02. Ventral tongue/
│ ├── 03. Left buccal mucosa/
│ ├── 04. Right buccal mucosa/
│ ├── 06. Lower lip/
│ ├── 07. Upper arch/
│ └── lesion json/
├── 03. OPMD/
│ ├── 01. Unannotated/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 05. Upper lip/
│ │ ├── 06. Lower lip/
│ │ └── 07. Upper arch/
│ ├── 02. Region annotation/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 05. Upper lip/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ └── region json/
│ ├── 03. Full annotation/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 05. Upper lip/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ └── full json/
│ └── 04. Lesion annotation/
│ ├── 02. Ventral tongue/
│ ├── 03. Left buccal mucosa/
│ ├── 04. Right buccal mucosa/
│ ├── 05. Upper lip/
│ ├── 06. Lower lip/
│ ├── 07. Upper arch/
│ └── lesion json/
├── 04. Oral Cancer/
│ ├── 01. Unannotated/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ └── 08. Lower arch/
│ ├── 02. Region annotation/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ ├── 08. Lower arch/
│ │ └── Region Json/
│ ├── 03. Full annotation/
│ │ ├── 02. Ventral tongue/
│ │ ├── 03. Left buccal mucosa/
│ │ ├── 04. Right buccal mucosa/
│ │ ├── 06. Lower lip/
│ │ ├── 07. Upper arch/
│ │ ├── 08. Lower arch/
│ │ └── Full Json/
│ └── 04. Lesion annotation/
│ ├── 02. Ventral tongue/
│ ├── 03. Left buccal mucosa/
│ ├── 04. Right buccal mucosa/
│ ├── 06. Lower lip/
│ ├── 07. Upper arch/
│ ├── 08. Lower arch/
│ └── Lesion Json/
├── Descriptors/
│ ├── 01. Descriptors_for_region_annotation.xlsx
│ ├── 02. Descriptors_for_full_annotation.xlsx
│ └── 03. Descriptors_for_lesion_annotation.xlsx
└── Metadata/
└── Patient's Metadata.xlsx
Human subjects dataWe confirm that all data collection and handling procedures adhered to ethical guidelines outlined in the Declaration of Helsinki and its amendments for research involving human subjects, with informed consent obtained from each subject.
To de-identify the data in the public domain, we have used unique alpha-numeric patient IDs for each subject.
创建时间:
2026-02-18



