AksharaOCR - Real-World Image-Based Sinhala and Sinhala-English mixed OCR Datasets
收藏IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/aksharaocr-real-world-image-based-sinhala-and-sinhala-english-mixed-ocr-datasets
下载链接
链接失效反馈官方服务:
资源简介:
We present the first publicly available, real-image-based OCR dataset for the Sinhala language, developed to support research in Optical Character Recognition (OCR) and multilingual document processing in low-resource settings. The dataset includes over 24,000 annotated text lines and 127,000 words extracted from authentic printed documents, covering both Sinhala-only and Sinhala-English code-mixed text. Unlike widely available synthetic datasets, this corpus captures the complexities of real-world documents, including noise, distortions, and varied lighting conditions. A custom annotation tool was used to generate high-quality line-level annotations under close human supervision. This dataset addresses a critical gap in Sinhala-language OCR resources, offering a robust benchmark for developing more accurate and adaptable OCR systems. (We are currently releasing a portion of the dataset; the full dataset will be made publicly available following the publication of the corresponding research paper.)
提供机构:
Ravindu Marasinghe; Ishan Warshamana; Praguna Chandrasekara; Thanuja Ambegoda



