five

SciDrawAI/SciDraw-6K

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/SciDrawAI/SciDraw-6K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-to-image language: - en - zh - ja - ko - de - fr - es - pt - it - ru tags: - scientific-illustration - gemini - multilingual - synthetic - text-to-image - scientific-visualization pretty_name: "SciDraw-6K: Multilingual Scientific Illustration Dataset" size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: metadata.parquet doi: 10.5281/zenodo.19642870 --- <!-- schema note: the `image` column holds a relative path to the file under `images/`; HF will render it as a thumbnail. --> # SciDraw-6K: A Multilingual Scientific Illustration Dataset Generated by Google Gemini ## Dataset Summary SciDraw-6K is a curated dataset of **6,291 scientific illustrations** synthesized by Google Gemini image-generation models, each paired with prompts in **11 languages** (English, Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Traditional Chinese, Italian, and Russian). Images span **8 broad scientific categories**: biomedical, chemistry, materials, electronics, environment, AI systems, physics, and a residual "other" bucket covering long-tail disciplines. ## Dataset Structure ``` ├── README.md ├── metadata.jsonl # Full metadata, one JSON object per line ├── metadata.parquet # Same data in Parquet format (if available) ├── metadata.validation.json # Export-time quality checks ├── splits.json # Train/val/test splits (prompt-grouped) └── images/ ├── biomedical/ # 2,827 images (~8.5 GB) ├── materials/ # 841 images (~2.7 GB) ├── ai_system/ # 705 images (~2.0 GB) ├── chemistry/ # 609 images (~1.8 GB) ├── environment/ # 581 images (~1.8 GB) ├── other/ # 396+ images (~1.2 GB) ├── electronics/ # 190 images (~569 MB) └── physics/ # 139 images (~378 MB) ``` ## Metadata Schema Each row in `metadata.jsonl` contains: | Field | Type | Description | |---|---|---| | `id` | string | Unique image identifier | | `image` | string | Relative path to local image file (e.g. `images/biomedical/gal_xxx.png`) | | `image_ext` | string | File extension (usually `png`) | | `raw_category` | string | Original fine-grained category label | | `release_category` | string | Normalized 8-class category | | `category` | string | Same as `release_category` | | `prompts` | object | 11-language prompt object (keys: `original`, `en`, `zh`, `ja`, `ko`, `de`, `fr`, `es`, `pt_br`, `zh_tw`, `it`, `ru`) | | `gemini_model` | string\|null | Gemini model identifier (null for ~7% of rows) | | `generation_type` | string\|null | Generation type (e.g., `text_to_image`) | | `created_at` | string | ISO 8601 timestamp | | `image_sha256` | string | SHA-256 hash of image bytes | ## Category Distribution | Category | Count | Percentage | |---|---|---| | biomedical | 2,827 | 44.9% | | materials | 841 | 13.4% | | ai_system | 705 | 11.2% | | chemistry | 609 | 9.7% | | environment | 581 | 9.2% | | other | 396 | 6.3% | | electronics | 190 | 3.0% | | physics | 139 | 2.2% | ## Source Models | Model | Count | |---|---| | gemini-3-pro-image-preview | 4,624 | | gemini-2.5-flash-image | 4,601 | | gemini-3.1-flash-image-preview | 130 | | unknown (null) | 428 | ## Multilingual Coverage All 11 language prompt fields are populated for **100%** of released images. ## Usage ```python from datasets import load_dataset ds = load_dataset("SciDrawAI/SciDraw-6K") ``` Or load the JSONL directly: ```python import json from pathlib import Path rows = [] with open("metadata.jsonl") as f: for line in f: rows.append(json.loads(line)) print(f"Total images: {len(rows)}") print(f"Categories: {set(r['release_category'] for r in rows)}") ``` ## Intended Uses - **Multilingual T2I research**: 11 aligned language prompts per image - **Domain-adapted diffusion fine-tuning**: Scientific illustration style transfer - **Prompt engineering studies**: Template-driven scientific visualization prompts - **Retrieval-augmented generation**: Few-shot exemplar retrieval for scientific figures ## Limitations - **Single-source bias**: All images from Google Gemini; stylistic biases are baked in - **Category imbalance**: Biomedical dominates (~45%); some disciplines have < 10 images - **English-anchored translations**: Non-English prompts are LLM translations, not native captions - **Incomplete provenance**: ~7% of rows lack model/generation-type metadata ## Citation ```bibtex @dataset{chen_scidraw6k_2026, author = {Chen, Davie}, title = {SciDraw-6K: A Multilingual Scientific Illustration Dataset Generated by Google Gemini}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.19642870}, url = {https://doi.org/10.5281/zenodo.19642870} } ``` ## License This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). ## Related Resources - **Service**: [sci-draw.com](https://sci-draw.com) — public scientific drawing platform powered by this dataset - **Code**: [github.com/SciDrawAI/scidraw-6k](https://github.com/SciDrawAI/scidraw-6k) — loading scripts, reproducible stats, retrieval demo - **DOI (Zenodo)**: [10.5281/zenodo.19642870](https://doi.org/10.5281/zenodo.19642870) - **Contact**: `contact@sci-draw.com`
提供机构:
SciDrawAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作