EvitFan/SmolTextRender

Name: EvitFan/SmolTextRender
Creator: EvitFan
Published: 2026-03-27 19:45:57
License: 暂无描述

Hugging Face2026-03-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/EvitFan/SmolTextRender

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit configs: - config_name: default data_files: - split: train path: data/train-* - split: val path: data/val-* - split: test path: data/test-* dataset_info: features: - name: id dtype: int64 - name: image_id dtype: string - name: image dtype: image - name: text dtype: string - name: caption dtype: string - name: prompt dtype: string - name: split dtype: string - name: ocr_confidence dtype: float64 - name: ocr_backend dtype: string - name: caption_model dtype: string - name: source dtype: string - name: sharpness dtype: float64 - name: brightness dtype: float64 - name: contrast dtype: float64 - name: resolution_w dtype: int64 - name: resolution_h dtype: int64 - name: text_length dtype: int64 - name: word_count dtype: int64 - name: phrase_reconstructed dtype: bool splits: - name: train num_bytes: 58573006 num_examples: 800 - name: val num_bytes: 6821157 num_examples: 100 - name: test num_bytes: 6848431 num_examples: 100 download_size: 72132017 dataset_size: 72242594 task_categories: - image-to-text - text-to-image language: - en tags: - ocr - image-captioning - text-rendering - synthetic - blip2 - easyocr - flux size_categories: - 1K<n<10K source_datasets: - stzhao/AnyWord-3M --- # Text-in-Image OCR Dataset *Built for **Project 12 — Efficient Image Generation**, as part of the ENSTA course [CSC_5IA21](https://giannifranchi.github.io/CSC_5IA21.html)* **Team:** Adam Gassem · Asma Walha · Achraf Chaouch · Takoua Ben Aissa · Amaury Lorin **Tutors:** Arturo Mendoza Quispe · Nacim Belkhir --- ## Dataset Summary A curated text-in-image dataset designed for fine-tuning text-to-image generative models (e.g. FLUX, Stable Diffusion, ControlNet) on accurate **text rendering**. Each sample pairs a real-world image containing readable text with: - a verified OCR transcription (EasyOCR), - a visual caption (BLIP-2), - and a training prompt that embeds the OCR text verbatim. Images are sourced from [AnyWord-3M](https://huggingface.co/datasets/stzhao/AnyWord-3M) and pass a rigorous multi-step quality pipeline before inclusion. --- ## Dataset Structure | Split | Size | |-------|------| | train | 800 samples | | val | 100 samples | | test | 100 samples | ### Fields | Field | Type | Description | |-------|------|-------------| | `image` | Image | The filtered image (512 px, JPEG) | | `text` | string | Verified OCR text found in the image | | `caption` | string | General visual description generated by BLIP-2 | | `prompt` | string | Training prompt embedding the OCR text verbatim | | `ocr_confidence` | float | EasyOCR confidence score (0–100) | | `ocr_backend` | string | OCR engine used (`easyocr`) | | `caption_model` | string | Captioning model used (`blip2` or `blip`) | | `source` | string | AnyWord-3M subset of origin | | `sharpness` | float | Laplacian variance of the image | | `brightness` | float | Mean pixel brightness | | `contrast` | float | Pixel standard deviation | | `resolution_w` / `resolution_h` | int | Image dimensions in pixels | | `text_length` | int | Character count of the OCR text | | `word_count` | int | Word count of the OCR text | | `phrase_reconstructed` | bool | Whether the full phrase was expanded beyond the bounding box | ### Sample record ```json { "image": "<PIL.Image>", "text": "OPEN", "caption": "A storefront with a neon sign above the door.", "prompt": "A storefront with a neon sign above the door, with the text \"OPEN\" clearly visible", "ocr_confidence": 87.5, "source": "AnyWord-3M/laion", "sharpness": 142.3, "resolution_w": 512, "resolution_h": 384 } ``` --- ## Usage ```python from datasets import load_dataset ds = load_dataset("your-org/your-dataset-name") # Access a training sample sample = ds["train"][0] print(sample["prompt"]) sample["image"].show() ``` For fine-tuning with the prompt field: ```python for sample in ds["train"]: image = sample["image"] # PIL image prompt = sample["prompt"] # text-conditioned training caption text = sample["text"] # ground-truth OCR string ``` --- ## Creation Pipeline Images are drawn from AnyWord-3M (streamed) and pass through the following stages: ``` AnyWord-3M stream │ ▼ 1. Annotation filtering → valid, short, English text regions only │ ▼ 2. Image quality gate → resolution ≥ 256 px, sharpness ≥ 80, brightness 30–230, contrast ≥ 20 │ ▼ 3. EasyOCR verify → confirm annotated text is readable (conf ≥ 0.40) │ ▼ 4. EasyOCR reconstruct → expand to the full visible phrase │ ▼ 5. BLIP-2 caption → general visual description │ ▼ 6. Prompt construction → natural sentence with OCR text in quotes │ ▼ 7. Split & save → 80 % train / 10 % val / 10 % test ``` --- ## Source Subsets | Subset | Description | |--------|-------------| | `laion` | Web-crawled natural images | | `OCR_COCO_Text` | COCO scene text | | `OCR_mlt2019` | Multi-language (English filtered) | | `OCR_Art` | Artistic / designed text | --- ## Citation & Project This dataset was produced as part of the **Efficient Image Generation** project at ENSTA Paris. Full methodology, training experiments, and inference benchmarks are documented in the [project report](https://drive.google.com/file/d/1ay4-cBOSt4LbLhwgQ0gBykda1Bu0HUXY/view?usp=drive_link). --- ## License Released under the **MIT License** — free to use, modify, and distribute without restriction. Note that the AnyWord-3M source dataset and BLIP-2 model are subject to their own respective licenses on HuggingFace.

提供机构：

EvitFan

5,000+

优质数据集

54 个

任务类型

进入经典数据集