OCR 合成基准数据集用于印度语言

Name: OCR 合成基准数据集用于印度语言
Creator: EkStep 基金会, Tarento 科技
Published: 2022-05-05 18:07:57
License: 暂无描述

arXiv2022-05-05 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2205.02543v1

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集名为‘OCR合成基准数据集用于印度语言’，由EkStep基金会和Tarento科技创建，包含90,000张图像及其对应的真实标签，涵盖23种印度语言。数据集通过合成数据生成技术创建，旨在为光学字符识别（OCR）模型提供多样化训练数据，以提高模型在印度语言上的准确性和鲁棒性。该数据集适用于计算机视觉和图像处理领域，特别是在处理多语言和复杂文档图像时，能够有效提升模型的性能和泛化能力。

This dataset, named 'OCR Synthetic Benchmark Dataset for Indian Languages', was developed by EkStep Foundation and Tarento Technologies. It consists of 90,000 images paired with their corresponding ground-truth labels, covering 23 Indian languages. The dataset is constructed using synthetic data generation technologies, with the goal of providing diverse training data for optical character recognition (OCR) models, thereby improving the accuracy and robustness of these models when processing Indian languages. This dataset is suitable for the fields of computer vision and image processing, and can effectively boost the performance and generalization capability of models, particularly when handling multilingual and complex document images.

提供机构：

EkStep 基金会, Tarento 科技

创建时间：

2022-05-05