iarata/PHCR-DB25
收藏Persian Historical Documents Handwritten Characters
数据集描述
概述
该数据集包含从5本手写波斯历史书籍中预处理的波斯文字符上下文形式的图像(不包括字母گ),使用Nastaliq脚本书写。数据集包含2775张图像,分为111个类别。图像格式为TIFF,分辨率为72 dpi,黑白图像,尺寸为395×395像素。
语言
波斯语
数据集结构
数据集结构如下:
├── data │ ├── 06a9_01.tif │ ├── 06a9_02.tif │ ├── 06a9_03.tif │ ├── 06a9_04.tif │ ├── 06a9_05.tif │ ├── ... │ ├── 06a9_25.tif │ │ │ ├── 06cc_01.tif │ ├── 06cc_02.tif │ ├── 06cc_03.tif │ ├── 06cc_04.tif │ ├── 06cc_05.tif │ ├── ... │ ├── 06cc_25.tif │ ├── ...
每张图像的命名表示字符上下文形式的UTF-16十六进制代码(Hex to String Decoder),后跟图像编号。编号中每5张图像来自一本新书。每个字符的上下文形式被视为一个单独的类别,共111个类别。
数据集创建
源数据
数据来自美国国会图书馆的5本历史波斯书籍:
图像预处理步骤包括:
- 图像归一化以减少字符背景噪声。
- 将归一化图像转换为单通道灰度图像。
- 对灰度图像应用图像阈值处理以去除字符背景。
- 对阈值化图像进行二值化处理,像素值大于0变为255(白色),值为0(黑色)的像素保持不变。
- 最后,对二值化图像进行反转。
标注
在预处理图像之前,字符从书籍中裁剪并保存为UTF-16十六进制代码加图像编号(例如06a9_01.tif)。
标注者
引用信息
Hajebrahimi, A., Santoso, M.E., Kovacs, M., Kryssanov, V.V. (2024). Few-Shot Learning for Character Recognition in Persian Historical Documents. In: Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Pardalos, P.M., Umeton, R. (eds) Machine Learning, Optimization, and Data Science. LOD 2023. Lecture Notes in Computer Science, vol 14505. Springer, Cham. https://doi.org/10.1007/978-3-031-53969-5_20
BibTeX: bibtex @InProceedings{10.1007/978-3-031-53969-5_20, author="Hajebrahimi, Alireza and Santoso, Michael Evan and Kovacs, Mate and Kryssanov, Victor V.", editor="Nicosia, Giuseppe and Ojha, Varun and La Malfa, Emanuele and La Malfa, Gabriele and Pardalos, Panos M. and Umeton, Renato", title="Few-Shot Learning for Character Recognition in Persian Historical Documents", booktitle="Machine Learning, Optimization, and Data Science", year="2024", publisher="Springer Nature Switzerland", address="Cham", pages="259--273", abstract="Digitizing historical documents is crucial for the preservation of cultural heritage. The digitization of documents written in Perso-Arabic scripts, however, presents multiple challenges. The Nastaliq calligraphy can be difficult to read even for a native speaker, and the four contextual forms of alphabet letters pose a complex task to current optical character recognition systems. To address these challenges, the presented study develops an approach for character recognition in Persian historical documents using few-shot learning with Siamese Neural Networks. A small, novel dataset is created from Persian historical documents for training and testing purposes. Experiments on the dataset resulted in a 94.75{%} testing accuracy for the few-shot learning task, and a 67{%} character recognition accuracy was observed on unseen documents for 111 distinct character classes.", isbn="978-3-031-53969-5" }



