SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/10896123

下载链接

链接失效反馈

官方服务：

资源简介：

SDADDS-Guelma : A Multi-purpose Dataset for Synthetic Degraded Arabic Documents Description: This is a partial release of the SDADDS-Guelma dataset. SDADDS-Guelma (Synthetic Degraded Arabic Document DataSet of the University of Guelma) is a database of synthetic noisy or degraded Arabic document images. It was created by Dr. Abderrahmane Kefali and his team to support research on preprocessing, analysis, and recognition of degraded Arabic documents, where having a large set of images for training and testing is essential. This dataset is made publicly available to researchers in the field of document analysis and recognition, with the hope that it will be useful and contribute to their research endeavors. In this first release of the dataset, 84 handwritten images and 120 printed images have been used, along with 25 images of historical backgrounds, forming a total of 26316 synthetic images of degraded Arabic documents along with their corresponding ground-truth files. This release is separated into two parts to facilitate upload and use: one for the handwritten documents and the second for the printed documents. Composition of the dataset: Each of the parts of the SDADDS-Guelma dataset is organized into directories as follows: TXT_Files: Contains texts in UTF-8 format. IMG: Contains images of printed and handwritten Arabic text constructed from the text files. Bin_IMG: Contains binary images corresponding to the original images. BG_IMG: Contains images of empty old document backgrounds used for the generation of synthetic historical document images. GT_Files: Contains XML annotation files corresponding to the text images. Degraded_IMG: This directory contains synthetically generated degraded images, separated into sub-directories based on noise types such as Local_Noise, Show_through, Rotation, Curvature, Comb_IMG, etc. Ground-truth information: Ground truth information is essential for a document dataset, as it annotates documents and represents their essential characteristics. Our dataset is designed to be a large-scale and multipurpose dataset. As such, our methodology ensures that ground truth information is provided at three levels: text level (character codes), pixel level (binary and cleaned image), and document physical structure and other annotation information level. Textual Ground Truth: these are identical to the original texts. Pixel-level ground truth: presented in the form of binary images. Ground truth at the document structure level: the structure of each document image, alongside the textual transcription of the words and PAWs, is recorded in a corresponding XML annotation file. The XML format utilized resembles that employed in similar works with adjustments made according to the specific characteristics of Arabic texts, including the presence of PAWs. Consequently, each original text image in our dataset is associated to an XML file detailing the entire ground truth and associated metadata. Structure of XML file: Each XML annotation file contains metadata about the document image and text content within the image, including the language, number of lines, and font attributes. It also provides detailed information about each text line, word, and Part of Arabic Words (PAWs), including their bounding boxes and textual transcriptions. Thus, each ground truth file takes the following form: .... .... .... .... Contact: Name: Dr. Abderrahmane KefaliAffiliation: University of 8 May 1945-Guelma, AlgeriaEmail: kefali.abderrahmane@univ-guelma.dz

创建时间：

2025-01-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集