SinOCR and SinFUND
收藏IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/sinocr-and-sinfund
下载链接
链接失效反馈官方服务:
资源简介:
We present the SinOCR and SinFUND datasets, two comprehensive resources designed to advance Optical Character Recognition (OCR) and form understanding for the Sinhala language. SinOCR, the first publicly available and the most extensive dataset for Sinhala OCR to date, includes 100,000 images featuring printed text in 200 different Sinhala fonts and 1,135 images of handwritten text, capturing a wide spectrum of writing styles. SinFUND, the first fully annotated dataset of its kind, comprises 100 diverse, manually filled Sinhala forms, offering a robust foundation for developing template-free form understanding models. These datasets are crucial for addressing the challenges posed by paper-based documentation in low-resource languages, enhancing accuracy and efficiency in digital document processing. Both datasets aim to stimulate further research and innovation, providing valuable benchmarks for the OCR and form understanding communities. Access to these datasets will facilitate the development of more sophisticated models, promoting digital transformation and improved administrative processes in Sri Lanka and potentially other regions with similar linguistic challenges. The benchmarks will be published in a research article with the same title.
提供机构:
Pushpakumara, Supul; Hewagama, Danusha; Ambegoda, Thanuja; Gunathilaka, Kavishka



