AksharaOCR - Real-World Image-Based Sinhala and Sinhala-English mixed OCR Datasets

Name: AksharaOCR - Real-World Image-Based Sinhala and Sinhala-English mixed OCR Datasets
Creator: Ravindu Marasinghe; Ishan Warshamana; Praguna Chandrasekara; Thanuja Ambegoda
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/aksharaocr-real-world-image-based-sinhala-and-sinhala-english-mixed-ocr-datasets

下载链接

链接失效反馈

官方服务：

资源简介：

We present the first publicly available, real-image-based OCR dataset for the Sinhala language, developed to support research in Optical Character Recognition (OCR) and multilingual document processing in low-resource settings. The dataset includes over 24,000 annotated text lines and 127,000 words extracted from authentic printed documents, covering both Sinhala-only and Sinhala-English code-mixed text. Unlike widely available synthetic datasets, this corpus captures the complexities of real-world documents, including noise, distortions, and varied lighting conditions. A custom annotation tool was used to generate high-quality line-level annotations under close human supervision. This dataset addresses a critical gap in Sinhala-language OCR resources, offering a robust benchmark for developing more accurate and adaptable OCR systems. (We are currently releasing a portion of the dataset; the full dataset will be made publicly available following the publication of the corresponding research paper.)

提供机构：

Ravindu Marasinghe; Ishan Warshamana; Praguna Chandrasekara; Thanuja Ambegoda

5,000+

优质数据集

54 个

任务类型

进入经典数据集