ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7863943
下载链接
链接失效反馈官方服务:
资源简介:
disclaimer: Version accepted as full paper in ICDAR 2023.
Optical Character Recognition (OCR) is a technology that enables machines to read and interpret printed or handwritten texts from scanned images or photographs. However, the accuracy of OCR systems can vary depending on several factors, such as the quality of the input image, the font used, and the language of the document. As a general tendency, OCR algorithms perform better in resource-rich languages as they have more annotated data to train the recognition process. We propose ESTER-Pt, an Evaluation Suite for TExt Recognition in Portuguese in this work. Despite being one of the largest languages in terms of speakers, OCR in Portuguese remains largely unexplored. Our evaluation suite comprises four types of resources: synthetic text-based documents, synthetic image-based documents, real scanned documents, and a hybrid set with real image-based documents that were synthetically degraded.
免责声明:本研究成果已被ICDAR 2023收录为完整论文。
光学字符识别(Optical Character Recognition, OCR)是一项可使机器读取并解析扫描图像或照片中的印刷体与手写文本的技术。然而,OCR系统的准确率会受多种因素影响,例如输入图像的质量、所使用的字体以及文档的语言。总体而言,资源充裕的语言拥有更多用于训练识别模型的标注数据,因此OCR算法在这类语言上的表现往往更为出色。
尽管葡萄牙语是使用者数量最多的语言之一,但针对葡萄牙语的OCR技术仍鲜有深入探索。本研究提出了ESTER-Pt,一款面向葡萄牙语的文本识别评估套件。该评估套件包含四类资源:基于合成文本的文档、基于合成图像的文档、真实扫描文档,以及一组由经合成退化处理的真实图像文档构成的混合数据集。
创建时间:
2023-04-28



