270 Million-Words Multi-Font Corpus of Classical Arabic with Diacritics

Name: 270 Million-Words Multi-Font Corpus of Classical Arabic with Diacritics
Creator: 计算机科学与人工智能学院，赫尔万大学，赫尔万，埃及
Published: 2022-08-27 05:02:07
License: 暂无描述

arXiv2022-08-27 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2208.11484v2

下载链接

链接失效反馈

官方服务：

资源简介：

本研究创建了目前最大的阿拉伯语OCR数据集，包含30.5百万张图像（即文本行）和270百万个单词，涵盖多种字体和书写风格。数据集不仅包括文本的地面实况，还特别关注了阿拉伯语的变音符号。该数据集通过收集自网络的图像以及从KHATT数据库中获取的手写印刷品构建而成，确保了模型能够训练所有类型的序列和句子位置，从而在历史印刷品和手写文本的分割阶段提供帮助。数据集的应用领域主要集中在解决历史阿拉伯文档的完整文本可访问性问题，旨在显著降低将扫描页面图像转换为可搜索完整文本的成本。

This study developed the largest Arabic OCR dataset to date, which comprises 30.5 million images (specifically text lines) and 270 million words, covering a wide range of fonts and writing styles. The dataset not only includes textual ground truth but also places special emphasis on Arabic diacritics. It is constructed from web-collected images and handwritten printed materials sourced from the KHATT database, ensuring that models can be trained on all types of sequences and sentence positions, thereby supporting the segmentation phase of historical printed and handwritten texts. The primary application of this dataset focuses on addressing the full text accessibility issue of historical Arabic documents, with the aim of significantly reducing the cost of converting scanned page images into searchable complete text.

提供机构：

计算机科学与人工智能学院，赫尔万大学，赫尔万，埃及

创建时间：

2022-08-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集