SciDrawAI/SciDraw-6K
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/SciDrawAI/SciDraw-6K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-to-image
language:
- en
- zh
- ja
- ko
- de
- fr
- es
- pt
- it
- ru
tags:
- scientific-illustration
- gemini
- multilingual
- synthetic
- text-to-image
- scientific-visualization
pretty_name: "SciDraw-6K: Multilingual Scientific Illustration Dataset"
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: metadata.parquet
doi: 10.5281/zenodo.19642870
---
<!-- schema note: the `image` column holds a relative path to the file under `images/`; HF will render it as a thumbnail. -->
# SciDraw-6K: A Multilingual Scientific Illustration Dataset Generated by Google Gemini
## Dataset Summary
SciDraw-6K is a curated dataset of **6,291 scientific illustrations** synthesized by Google Gemini image-generation models, each paired with prompts in **11 languages** (English, Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Traditional Chinese, Italian, and Russian).
Images span **8 broad scientific categories**: biomedical, chemistry, materials, electronics, environment, AI systems, physics, and a residual "other" bucket covering long-tail disciplines.
## Dataset Structure
```
├── README.md
├── metadata.jsonl # Full metadata, one JSON object per line
├── metadata.parquet # Same data in Parquet format (if available)
├── metadata.validation.json # Export-time quality checks
├── splits.json # Train/val/test splits (prompt-grouped)
└── images/
├── biomedical/ # 2,827 images (~8.5 GB)
├── materials/ # 841 images (~2.7 GB)
├── ai_system/ # 705 images (~2.0 GB)
├── chemistry/ # 609 images (~1.8 GB)
├── environment/ # 581 images (~1.8 GB)
├── other/ # 396+ images (~1.2 GB)
├── electronics/ # 190 images (~569 MB)
└── physics/ # 139 images (~378 MB)
```
## Metadata Schema
Each row in `metadata.jsonl` contains:
| Field | Type | Description |
|---|---|---|
| `id` | string | Unique image identifier |
| `image` | string | Relative path to local image file (e.g. `images/biomedical/gal_xxx.png`) |
| `image_ext` | string | File extension (usually `png`) |
| `raw_category` | string | Original fine-grained category label |
| `release_category` | string | Normalized 8-class category |
| `category` | string | Same as `release_category` |
| `prompts` | object | 11-language prompt object (keys: `original`, `en`, `zh`, `ja`, `ko`, `de`, `fr`, `es`, `pt_br`, `zh_tw`, `it`, `ru`) |
| `gemini_model` | string\|null | Gemini model identifier (null for ~7% of rows) |
| `generation_type` | string\|null | Generation type (e.g., `text_to_image`) |
| `created_at` | string | ISO 8601 timestamp |
| `image_sha256` | string | SHA-256 hash of image bytes |
## Category Distribution
| Category | Count | Percentage |
|---|---|---|
| biomedical | 2,827 | 44.9% |
| materials | 841 | 13.4% |
| ai_system | 705 | 11.2% |
| chemistry | 609 | 9.7% |
| environment | 581 | 9.2% |
| other | 396 | 6.3% |
| electronics | 190 | 3.0% |
| physics | 139 | 2.2% |
## Source Models
| Model | Count |
|---|---|
| gemini-3-pro-image-preview | 4,624 |
| gemini-2.5-flash-image | 4,601 |
| gemini-3.1-flash-image-preview | 130 |
| unknown (null) | 428 |
## Multilingual Coverage
All 11 language prompt fields are populated for **100%** of released images.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("SciDrawAI/SciDraw-6K")
```
Or load the JSONL directly:
```python
import json
from pathlib import Path
rows = []
with open("metadata.jsonl") as f:
for line in f:
rows.append(json.loads(line))
print(f"Total images: {len(rows)}")
print(f"Categories: {set(r['release_category'] for r in rows)}")
```
## Intended Uses
- **Multilingual T2I research**: 11 aligned language prompts per image
- **Domain-adapted diffusion fine-tuning**: Scientific illustration style transfer
- **Prompt engineering studies**: Template-driven scientific visualization prompts
- **Retrieval-augmented generation**: Few-shot exemplar retrieval for scientific figures
## Limitations
- **Single-source bias**: All images from Google Gemini; stylistic biases are baked in
- **Category imbalance**: Biomedical dominates (~45%); some disciplines have < 10 images
- **English-anchored translations**: Non-English prompts are LLM translations, not native captions
- **Incomplete provenance**: ~7% of rows lack model/generation-type metadata
## Citation
```bibtex
@dataset{chen_scidraw6k_2026,
author = {Chen, Davie},
title = {SciDraw-6K: A Multilingual Scientific Illustration
Dataset Generated by Google Gemini},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19642870},
url = {https://doi.org/10.5281/zenodo.19642870}
}
```
## License
This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
## Related Resources
- **Service**: [sci-draw.com](https://sci-draw.com) — public scientific drawing platform powered by this dataset
- **Code**: [github.com/SciDrawAI/scidraw-6k](https://github.com/SciDrawAI/scidraw-6k) — loading scripts, reproducible stats, retrieval demo
- **DOI (Zenodo)**: [10.5281/zenodo.19642870](https://doi.org/10.5281/zenodo.19642870)
- **Contact**: `contact@sci-draw.com`
提供机构:
SciDrawAI



