Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts with Corresponding Ground Truth Annotations
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15009379
下载链接
链接失效反馈官方服务:
资源简介:
This dataset provides a benchmark for Tamil Optical Character Recognition (OCR), covering both handwritten (Hangual) and printed Tamil text. It includes high-quality ground truth (GT) text files paired with corresponding TIFF images, making it valuable for training and evaluating OCR models, particularly for Tesseract, deep learning-based recognition, and AI research.
Dataset Highlights
Total Size: 15GB (Sample from the full 60GB dataset)
Total Pairs:Approximately 1,903,284 text-image pairs
Handwritten Fonts (9 Unicode Fonts):
Aazhi, Gnani, Hemalatha, Indumathi, Kalayarasi, Siva_01, Siva_02, Sudeeptha, Yogeshwaran
Printed Fonts (9 Unicode Fonts):
AnekTamil, Arima, KarlaTamilInclined, TAU-Barathi, TAU-Kambar, TAU-Marutham, TAU-Mullai, TAU-Neythal, TAU-Valluvar
Data Source:
The text corpus (GT text files) is curated from Wikipedia and Wikisource, ensuring linguistic diversity.
The fonts are publicly available Unicode Tamil fonts, sourced from Google Fonts and Tamil Virtual University.
File Structure
Tamil_OCR_Dataset/├── Hangual_Fonts/│ ├── Aazhi/│ │ ├── gt/│ │ │ ├── 00001.gt.txt│ │ │ ├── 00002.gt.txt│ │ ├── images/│ │ │ ├── 00001.tiff│ │ │ ├── 00002.tiff│ ├── Gnani/│ ├── ...├── Printed_Fonts/│ ├── AnekTamil/│ ├── HindMadurai/│ ├── ...
Cite this work
@dataset{tamilocr_dataset_2025, author = {Syedkhaleel Jageer}, title = {Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts with Corresponding Ground Truth Annotations}, year = {2025}, publisher = {Zenodo}, doi = {10.5281/zenodo.15009380}, url = {https://doi.org/10.5281/zenodo.15009380}}
创建时间:
2025-03-22



