five

DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://doi.org/10.7910/DVN/QWQEMM
下载链接
链接失效反馈
官方服务:
资源简介:
A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image-text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image-text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at https://github.com/abdurrahimyilmaz/DermaSynth ## Citation Please cite the paper, if you use the data and code in your research: @article{yilmaz2025dermasynth, title={DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets}, author={Yilmaz, Abdurrahim and Yuceyalcin, Furkan and Gokyayla, Ece and Choi, Donghee and Erdem, Ozan and Demircali, Ali Anil and Varol, Rahmetullah and Kirabali, Ufuk Gorkem and Gencoglan, Gulsum and Posma, Joram M and Temelkuran, Burak}, journal={arXiv preprint arXiv:2502.00196}, year={2025} }
创建时间:
2025-03-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作