SciDraw-6K: A Multilingual Scientific Illustration Dataset Generated by Google Gemini
收藏DataCite Commons2026-04-20 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/m8z2jyr7z5/1
下载链接
链接失效反馈官方服务:
资源简介:
SciDraw-6K is a curated dataset of 6,291 scientific illustrations synthesized by Google Gemini image-generation models (primarily gemini-3-pro-image-preview and gemini-2.5-flash-image), each paired with aligned prompts in eleven languages: English, Simplified Chinese, Traditional Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Italian, and Russian.
Images span eight broad scientific categories — biomedical (44.9%), materials (13.4%), AI systems (11.2%), chemistry (9.7%), environment (9.2%), electronics (3.0%), physics (2.2%), and a residual "other" bucket (6.3%) covering long-tail disciplines such as robotics, mathematics, economics, civil engineering, and geosciences.
Unlike general-purpose text-to-image corpora (LAION-5B, JourneyDB, DiffusionDB) which are dominated by photorealistic and artistic content, SciDraw-6K is purpose-built for the scientific-illustration genre: schematic diagrams, mechanism figures, table-of-contents graphical abstracts, and conceptual posters. The dataset is constructed via a domain-specific prompt taxonomy, Gemini image generation, LLM-based translation, and lightweight quality control.
Each row of the metadata contains: a stable image ID, the public image URL, the file extension, the category label, the eleven multilingual prompts, the Gemini model identifier, the generation type, the creation timestamp, and the SHA-256 hash of the downloaded image bytes.
Intended uses include: multilingual text-to-image research, domain-adapted diffusion fine-tuning, prompt-engineering studies for scientific visualization, retrieval-augmented generation for scientific figure synthesis, and benchmarking frontier image-generation models on the specialized visual grammar of science.
This Mendeley Data record archives the dataset metadata. The full image payload (~19 GB) is hosted on Hugging Face (https://huggingface.co/datasets/SciDrawAI/SciDraw-6K) and archived on Zenodo (https://doi.org/10.5281/zenodo.19642870) and Harvard Dataverse (https://doi.org/10.7910/DVN/L02REW). Construction scripts are available at https://github.com/SciDrawAI/scidraw-6k. Service powered by this dataset: https://sci-draw.com.
提供机构:
Mendeley Data
创建时间:
2026-04-20



