PuristanLabs1/urdu-ocr-1M

Name: PuristanLabs1/urdu-ocr-1M
Creator: PuristanLabs1
Published: 2026-02-01 08:45:40
License: 暂无描述

Hugging Face2026-02-01 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/PuristanLabs1/urdu-ocr-1M

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: nastaliq default: true data_files: - split: train path: nastaliq/train-*.parquet - split: val path: nastaliq/val-*.parquet - config_name: naskh data_files: - split: train path: data/train-*.parquet - split: val path: data/val-*-of-00001.parquet license: mit task_categories: - image-to-text language: - ur tags: - ocr - nastaliq - urdu size_categories: - 1M-10M --- # Urdu OCR Dataset (1.5 Million Samples) ## Dataset Summary This is a large-scale synthetic dataset for Urdu Optical Character Recognition (OCR), featuring a groundbreaking **Nastaliq** collection and a robust **Naskh** base. 1. **Nastaliq (Primary)**: 499,845 samples rendered with authentic Jameel Noori Nastaliq ligatures using a custom Chromium-based rendering pipeline. 2. **Naskh**: 1,000,160 samples in standard Urdu fonts for baseline OCR tasks. Totaling **1.5 Million** samples, this dataset is provided in optimized sharded Parquet format for efficient streaming. ## Dataset Structure | Style | Samples | Description | | :--- | :--- | :--- | | **Nastaliq** | 500,000 (approx) | Authentic cascading ligatures + aggressive scan augmentation. | | **Naskh** | 1,000,160 | Standard horizontal rendering. | ### Schema - `image`: The rendered Urdu text image (normalized to 64px height). - `text`: The ground truth Urdu string. - `filename`: Reference to source and variant type. - `style`: "nastaliq" or "naskh". ## The Nastaliq Recipe (Phase 2) To capture the complexity of Nastaliq script, we moved beyond standard image rendering: ### 1. High-Fidelity Rendering Unlike standard PIL-based rendering which often breaks Urdu ligatures, we utilized a **Headless Chromium engine** (via Playwright) to render text with 100% authentic Jameel Noori Nastaliq kerning and vertical stacking. ### 2. Manual Quality Control A pilot batch was manually reviewed by native speakers to identify and remove "gibberish" or clipped samples. 31 specific low-quality patterns were identified and purged from the final 500k set. ### 3. Aggressive 1:5 Augmentation Each of the ~100k unique text lines was used to generate **4 unique "Destruction" variants** specifically designed to simulate real-world scanned documents: * **V1: Dusty Scan**: Grayscale grain + Pepper noise. * **V2: Thin/Compressed**: Morphological erosion + JPEG compression. * **V3: Heavy Ink**: Morphological dilation (ink bleed) + Gaussian softening. * **V4: Snowy Field**: Coarse grain + Salt noise (white specs). ## Usage ```python from datasets import load_dataset # Load the new Nastaliq configuration dataset = load_dataset("PuristanLabs1/urdu-ocr-1M", "nastaliq", streaming=True) sample = next(iter(dataset["train"])) sample["image"].show() ``` ## Credits Produced by **Puristan Labs**. Specialized Nastaliq pipeline developed using Chromium-based rendering and character-preserving morphological augmentations.

提供机构：

PuristanLabs1

5,000+

优质数据集

54 个

任务类型

进入经典数据集