LocalDoc/azerbaijani-htr-synthetic

Name: LocalDoc/azerbaijani-htr-synthetic
Creator: LocalDoc
Published: 2026-04-25 14:35:01
License: 暂无描述

Hugging Face2026-04-25 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/LocalDoc/azerbaijani-htr-synthetic

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - az license: cc-by-4.0 size_categories: - 1M<n<10M task_categories: - image-to-text tags: - ocr - htr - handwritten-text-recognition - azerbaijani - synthetic pretty_name: Azerbaijani Synthetic Handwritten OCR Dataset configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: image dtype: image - name: text dtype: string - name: font dtype: string - name: profile dtype: string splits: - name: train num_bytes: 33441242062 num_examples: 1402306 - name: validation num_bytes: 894779841 num_examples: 37160 - name: test num_bytes: 886726559 num_examples: 37282 download_size: 31586595391 dataset_size: 35222748462 --- # Azerbaijani Synthetic Handwritten OCR Dataset A large-scale synthetic dataset for training handwritten text recognition (HTR) models on Azerbaijani Latin script. Generated using a procedural pipeline that combines real-world handwriting fonts with realistic scan-style augmentations. This dataset addresses the lack of publicly available Azerbaijani handwriting OCR data — a low-resource language for which no IAM-equivalent corpus exists. ## Dataset Statistics | Property | Value | |---|---| | Total samples | ~1,500,000 line images | | Language | Azerbaijani (Latin script) | | Image format | JPEG, line-level crops | | Average image size | ~800 × 70 pixels | | Total size on disk | ~30 GB | | Splits | train (95%), validation (2.5%), test (2.5%) | ### Per-sample fields | Field | Type | Description | |---|---|---| | `image` | Image | Line-level RGB image of synthesized handwriting | | `text` | string | Ground truth transcription (NFC-normalized) | | `font` | string | Font filename used for rendering (for analysis) | | `profile` | string | Augmentation profile applied (`mixed`, `school`, `office`, `archival`) | ## How the dataset was built The pipeline takes plain text from Azerbaijani corpora, renders each line using a randomly selected handwriting font, applies realistic scan-style augmentations, and saves the result as a (image, text) pair. ### Step 1 — Text corpus assembly Two text sources were combined to balance natural prose with document-specific patterns rarely seen in standard corpora: **Source A — Parallel corpus (~95.5% of samples).** The Azerbaijani side of [LocalDoc/azerbaijani-english-parallel-corpus](https://huggingface.co/datasets/LocalDoc/azerbaijani-english-parallel-corpus), which provides ~3.9M unique sentences after deduplication. Sentences shorter than 10 characters or longer than 250 characters were filtered out. Unicode was NFC-normalized. **Source B — Specialized strings (~4.5% of samples).** Programmatically generated using the [az-data-generator](https://pypi.org/project/az-data-generator/) library, producing realistic short strings that are common in handwritten documents but absent from prose corpora: | Category | Examples | |---|---| | Dates | `15.06.1985`, `12 aprel 2025-ci il`, `1985-ci ildə` | | Full names | `Eldar Məmmədov`, `E. Əliyev`, `Qənirə İmanzadə` | | Phone numbers | `+994 50 143 59 89`, `(055) 211-58-69`, `010-914-16-12` | | Addresses | `Bakı şəhəri, Nizami küçəsi, bina 45, mənzil 12, AZ 1005` | | Geographic names | `Zaqatala`, `Yasamal`, `Əhməd Cəmil küçəsi` | | Document IDs | `AA 7146179`, `68-CH-088`, `AZE 1234567` | | Amounts | `1 234,56 AZN`, `37,4%`, `7766 €`, `25 kq` | | Form fields | `Doğum tarixi: 15.06.1985`, `Tel: +994 70 156 91 54` | | Signatures | `Hazırladı: N. Əliyev, 12.04.2025` | These specialized strings significantly increase digit density (~23% vs ~3% in plain prose) and improve coverage of out-of-vocabulary tokens like proper names and addresses, which are common failure modes of OCR models trained on prose alone. The two sources were merged and shuffled deterministically (seed 42) into a combined corpus of ~4.1M unique strings. ### Step 2 — Font collection and validation A pool of handwriting fonts was collected from Google Fonts (Latin Extended subset), filtered through a custom validator that checks: 1. **Character coverage** — every font must support all Azerbaijani-specific characters: `ə Ə ğ Ğ ş Ş ı İ ü Ü ö Ö ç Ç`. The schwa `ə` is the most commonly missing glyph in handwriting fonts; ~50% of candidate fonts fail this check. 2. **Visual suitability** — decorative calligraphic scripts (Qwitcher Grypen, Great Vibes, Allura, Birthstone, etc.) were excluded by name-based blacklist. These fonts have thin connected strokes that produce unreadable "worm-like" output incompatible with realistic OCR training. 3. **Variant filtering** — Bold variants of handwriting families were excluded because they tend to "blob out" under ink-thickening augmentations, producing unreadable "sausage-like" samples. The final font set contains 31 validated handwriting fonts representing diverse writing styles (marker, pen, brush, school-cursive). ### Step 3 — Image generation Each line image was generated as follows: **Word-level rendering with organic deformations.** Instead of rendering the full line as a single text element, each word is rendered separately with: - Per-word vertical jitter (±6 pixels) — words sit at slightly different baselines - Per-word rotation (±1.2°) — each word is independently angled - Per-word elastic deformation (55% probability) — slight non-linear distortion - Negative inter-word spacing — soft contact between adjacent words This produces the natural "wobble" of human handwriting that uniform line rendering misses. **Baseline waviness** uses Perlin noise rather than a sine wave, giving aperiodic baseline drift that better matches real handwriting. **Ink effects** include random dilation (30% probability), Gaussian blur, ink color variation (black, blue, dark blue, sepia for aged backgrounds), occasional ink blobs, and double-stroke artifacts. **Backgrounds** include white, cream-colored, lined notebook, grid, and aged/yellowed paper textures, weighted according to the chosen profile. **Geometric and photometric augmentations** include: - Rotation up to ±4° - Light perspective warping - Random shadows from page folds (25% probability) - Bleed-through from reverse side (15% probability for archival profile) - Gaussian noise - JPEG recompression at quality 55–95 ### Step 4 — Profile selection Each sample is generated under one of four profiles, controlling the distribution of background types and degradations: - **`mixed`** — balanced default, used for the majority of samples - **`school`** — emphasizes lined paper, simulates student notebooks - **`office`** — clean white backgrounds, simulates official forms - **`archival`** — yellowed paper, bleed-through, heavier noise For this release, all samples use the `mixed` profile. ### Step 5 — Output Images are saved as JPEG (quality 90) to keep file sizes manageable. Labels are stored alongside in JSONL format with one record per image. The full dataset comprises ~1.5M (image, text) pairs. ## Important caveats **This is synthetic data — not real handwriting.** Models trained exclusively on this dataset will achieve high accuracy on similar synthetic data, but performance on real-world scans depends on: 1. How well the augmentation profile matches the target distribution (notebook scans vs. archival documents vs. forms) 2. The diversity of handwriting styles represented by the font pool 3. Domain shift between synthetic and real handwriting **Expected baseline performance** when fine-tuning a TrOCR-base model on this dataset: - CER 1–3% on held-out synthetic test set - CER 15–25% on real handwritten documents (without fine-tuning on real data) To reduce the gap, this dataset is best used as a pretraining stage, followed by fine-tuning on a smaller set of manually labeled real handwriting samples. **The dataset includes some noisy samples** that survived the generation pipeline. The author chose to release the full unfiltered output — researchers who prefer cleaner data can apply post-hoc filtering on `font` field (excluding decorative/calligraphic fonts) or on image properties (ink density, aspect ratio, connected components). ## Usage ```python from datasets import load_dataset ds = load_dataset("LocalDoc/azerbaijani-htr-synthetic") print(ds) # DatasetDict({ # train: Dataset({features: ['image', 'text', 'font', 'profile'], num_rows: ~1425000}), # validation: Dataset({features: [...], num_rows: ~37500}), # test: Dataset({features: [...], num_rows: ~37500}) # }) # Access a sample sample = ds["train"][0] sample["image"].show() print(sample["text"]) ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{azerbaijani_htr_synthetic_2026, author = {LocalDoc}, title = {Azerbaijani Synthetic Handwritten OCR Dataset}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/LocalDoc/azerbaijani-htr-synthetic} } ``` ## License Released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — free to use for research and commercial purposes with attribution. The underlying text comes from a CC-licensed parallel corpus and from programmatically generated synthetic strings. Fonts used for rendering are licensed under SIL Open Font License (OFL). ## Acknowledgments - Text corpus: [LocalDoc/azerbaijani-english-parallel-corpus](https://huggingface.co/datasets/LocalDoc/azerbaijani-english-parallel-corpus) - Specialized data generation: [az-data-generator](https://pypi.org/project/az-data-generator/) - Fonts: Google Fonts (SIL OFL)

提供机构：

LocalDoc

5,000+

优质数据集

54 个

任务类型

进入经典数据集