iarata/PHCR-DB25

Name: iarata/PHCR-DB25
Creator: iarata
Published: 2024-02-20 11:27:00
License: 暂无描述

Hugging Face2024-02-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/iarata/PHCR-DB25

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含来自5本波斯历史手写书籍的波斯字符上下文形式的预处理图像（不包括字母گ），总计2775张图像，分为111个类别。图像为TIFF格式，分辨率为72 dpi，大小为395×395像素，黑白图像。数据集的结构通过文件夹和文件命名方式进行了详细说明，文件命名规则基于字符的UTF-16十六进制代码和图像编号。数据集的创建过程包括图像归一化、灰度转换、阈值处理和二值化等步骤。数据来源为美国国会图书馆的5本波斯历史书籍。注释信息包括字符裁剪和保存的详细过程，以及注释者的信息。

提供机构：

iarata

原始信息汇总

Persian Historical Documents Handwritten Characters

数据集描述

概述

该数据集包含从5本手写波斯历史书籍中预处理的波斯文字符上下文形式的图像（不包括字母گ），使用Nastaliq脚本书写。数据集包含2775张图像，分为111个类别。图像格式为TIFF，分辨率为72 dpi，黑白图像，尺寸为395×395像素。

语言

波斯语

数据集结构

数据集结构如下：

├── data │ ├── 06a9_01.tif │ ├── 06a9_02.tif │ ├── 06a9_03.tif │ ├── 06a9_04.tif │ ├── 06a9_05.tif │ ├── ... │ ├── 06a9_25.tif │ │ │ ├── 06cc_01.tif │ ├── 06cc_02.tif │ ├── 06cc_03.tif │ ├── 06cc_04.tif │ ├── 06cc_05.tif │ ├── ... │ ├── 06cc_25.tif │ ├── ...

每张图像的命名表示字符上下文形式的UTF-16十六进制代码（Hex to String Decoder），后跟图像编号。编号中每5张图像来自一本新书。每个字符的上下文形式被视为一个单独的类别，共111个类别。

数据集创建

源数据

数据来自美国国会图书馆的5本历史波斯书籍：

图像预处理步骤包括：

图像归一化以减少字符背景噪声。
将归一化图像转换为单通道灰度图像。
对灰度图像应用图像阈值处理以去除字符背景。
对阈值化图像进行二值化处理，像素值大于0变为255（白色），值为0（黑色）的像素保持不变。
最后，对二值化图像进行反转。

标注

在预处理图像之前，字符从书籍中裁剪并保存为UTF-16十六进制代码加图像编号（例如06a9_01.tif）。

标注者

引用信息

Hajebrahimi, A., Santoso, M.E., Kovacs, M., Kryssanov, V.V. (2024). Few-Shot Learning for Character Recognition in Persian Historical Documents. In: Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Pardalos, P.M., Umeton, R. (eds) Machine Learning, Optimization, and Data Science. LOD 2023. Lecture Notes in Computer Science, vol 14505. Springer, Cham. https://doi.org/10.1007/978-3-031-53969-5_20

BibTeX: bibtex @InProceedings{10.1007/978-3-031-53969-5_20, author="Hajebrahimi, Alireza and Santoso, Michael Evan and Kovacs, Mate and Kryssanov, Victor V.", editor="Nicosia, Giuseppe and Ojha, Varun and La Malfa, Emanuele and La Malfa, Gabriele and Pardalos, Panos M. and Umeton, Renato", title="Few-Shot Learning for Character Recognition in Persian Historical Documents", booktitle="Machine Learning, Optimization, and Data Science", year="2024", publisher="Springer Nature Switzerland", address="Cham", pages="259--273", abstract="Digitizing historical documents is crucial for the preservation of cultural heritage. The digitization of documents written in Perso-Arabic scripts, however, presents multiple challenges. The Nastaliq calligraphy can be difficult to read even for a native speaker, and the four contextual forms of alphabet letters pose a complex task to current optical character recognition systems. To address these challenges, the presented study develops an approach for character recognition in Persian historical documents using few-shot learning with Siamese Neural Networks. A small, novel dataset is created from Persian historical documents for training and testing purposes. Experiments on the dataset resulted in a 94.75{%} testing accuracy for the few-shot learning task, and a 67{%} character recognition accuracy was observed on unseen documents for 111 distinct character classes.", isbn="978-3-031-53969-5" }

5,000+

优质数据集

54 个

任务类型

进入经典数据集