RAVI: Synthetic Urdu Text Image Dataset for OCR
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/mhy5vxnths
下载链接
链接失效反馈官方服务:
资源简介:
The RAVI dataset is a synthetic image dataset designed to support the development and training of Urdu OCR (Optical Character Recognition) models. It consists of 99,000 high-resolution images (256x256 pixels), each containing a single Urdu word rendered in black text on a white background. The images are labeled with their corresponding Urdu words, enabling both supervised training and evaluation of word-level OCR systems. The text in the images is rendered using the “Jameel Noori Nastaleeq” font, a popular and widely used Nastaliq-style Urdu font, at font size 40.
The dataset is organized into subfolders corresponding to the Urdu alphabet, allowing for easier categorization, retrieval, and model evaluation based on character-specific performance.
This dataset is particularly valuable for researchers and developers working on CNN-based OCR systems, including both printed and future handwritten text recognition in Urdu. It can serve as a benchmark for word-level OCR models, sequence prediction architectures, and other deep learning applications in low-resource languages.
Key Features:
99,000 annotated images
Image resolution: 256x256 pixels
Black Urdu text on white background
Font: Jameel Noori Nastaleeq, size 40
Organized alphabetically by Urdu letters
Suitable for training, validation, and benchmarking of OCR systems
创建时间:
2025-06-16



