PuristanLabs1/urdu-ocr-1M
收藏Hugging Face2026-02-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/PuristanLabs1/urdu-ocr-1M
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: nastaliq
default: true
data_files:
- split: train
path: nastaliq/train-*.parquet
- split: val
path: nastaliq/val-*.parquet
- config_name: naskh
data_files:
- split: train
path: data/train-*.parquet
- split: val
path: data/val-*-of-00001.parquet
license: mit
task_categories:
- image-to-text
language:
- ur
tags:
- ocr
- nastaliq
- urdu
size_categories:
- 1M-10M
---
# Urdu OCR Dataset (1.5 Million Samples)
## Dataset Summary
This is a large-scale synthetic dataset for Urdu Optical Character Recognition (OCR), featuring a groundbreaking **Nastaliq** collection and a robust **Naskh** base.
1. **Nastaliq (Primary)**: 499,845 samples rendered with authentic Jameel Noori Nastaliq ligatures using a custom Chromium-based rendering pipeline.
2. **Naskh**: 1,000,160 samples in standard Urdu fonts for baseline OCR tasks.
Totaling **1.5 Million** samples, this dataset is provided in optimized sharded Parquet format for efficient streaming.
## Dataset Structure
| Style | Samples | Description |
| :--- | :--- | :--- |
| **Nastaliq** | 500,000 (approx) | Authentic cascading ligatures + aggressive scan augmentation. |
| **Naskh** | 1,000,160 | Standard horizontal rendering. |
### Schema
- `image`: The rendered Urdu text image (normalized to 64px height).
- `text`: The ground truth Urdu string.
- `filename`: Reference to source and variant type.
- `style`: "nastaliq" or "naskh".
## The Nastaliq Recipe (Phase 2)
To capture the complexity of Nastaliq script, we moved beyond standard image rendering:
### 1. High-Fidelity Rendering
Unlike standard PIL-based rendering which often breaks Urdu ligatures, we utilized a **Headless Chromium engine** (via Playwright) to render text with 100% authentic Jameel Noori Nastaliq kerning and vertical stacking.
### 2. Manual Quality Control
A pilot batch was manually reviewed by native speakers to identify and remove "gibberish" or clipped samples. 31 specific low-quality patterns were identified and purged from the final 500k set.
### 3. Aggressive 1:5 Augmentation
Each of the ~100k unique text lines was used to generate **4 unique "Destruction" variants** specifically designed to simulate real-world scanned documents:
* **V1: Dusty Scan**: Grayscale grain + Pepper noise.
* **V2: Thin/Compressed**: Morphological erosion + JPEG compression.
* **V3: Heavy Ink**: Morphological dilation (ink bleed) + Gaussian softening.
* **V4: Snowy Field**: Coarse grain + Salt noise (white specs).
## Usage
```python
from datasets import load_dataset
# Load the new Nastaliq configuration
dataset = load_dataset("PuristanLabs1/urdu-ocr-1M", "nastaliq", streaming=True)
sample = next(iter(dataset["train"]))
sample["image"].show()
```
## Credits
Produced by **Puristan Labs**. Specialized Nastaliq pipeline developed using Chromium-based rendering and character-preserving morphological augmentations.
提供机构:
PuristanLabs1



