Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts with Corresponding Ground Truth Annotations

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/15009379

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset provides a benchmark for Tamil Optical Character Recognition (OCR), covering both handwritten (Hangual) and printed Tamil text. It includes high-quality ground truth (GT) text files paired with corresponding TIFF images, making it valuable for training and evaluating OCR models, particularly for Tesseract, deep learning-based recognition, and AI research. Dataset Highlights Total Size: 15GB (Sample from the full 60GB dataset) Total Pairs:Approximately 1,903,284 text-image pairs Handwritten Fonts (9 Unicode Fonts): Aazhi, Gnani, Hemalatha, Indumathi, Kalayarasi, Siva_01, Siva_02, Sudeeptha, Yogeshwaran Printed Fonts (9 Unicode Fonts): AnekTamil, Arima, KarlaTamilInclined, TAU-Barathi, TAU-Kambar, TAU-Marutham, TAU-Mullai, TAU-Neythal, TAU-Valluvar Data Source: The text corpus (GT text files) is curated from Wikipedia and Wikisource, ensuring linguistic diversity. The fonts are publicly available Unicode Tamil fonts, sourced from Google Fonts and Tamil Virtual University. File Structure Tamil_OCR_Dataset/├── Hangual_Fonts/│ ├── Aazhi/│ │ ├── gt/│ │ │ ├── 00001.gt.txt│ │ │ ├── 00002.gt.txt│ │ ├── images/│ │ │ ├── 00001.tiff│ │ │ ├── 00002.tiff│ ├── Gnani/│ ├── ...├── Printed_Fonts/│ ├── AnekTamil/│ ├── HindMadurai/│ ├── ... Cite this work @dataset{tamilocr_dataset_2025, author = {Syedkhaleel Jageer}, title = {Synthetic OCR Dataset: 105,738 Tamil Text Lines Rendered in 18 Diverse Fonts with Corresponding Ground Truth Annotations}, year = {2025}, publisher = {Zenodo}, doi = {10.5281/zenodo.15009380}, url = {https://doi.org/10.5281/zenodo.15009380}}

创建时间：

2025-03-22