Tamil Palm Leaf Character Dataset: Single and Multi-Segmented Characters Across 20 Different Centuries.
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/b7vhz7z83k
下载链接
链接失效反馈官方服务:
资源简介:
This dataset consists of Tamil palm leaf manuscript Characters collected from 20 different centuries. The dataset is divided into single-segmented characters (individual Tamil letters) and multi-segmented characters (complex characters composed of multiple strokes). The single segmented characters are labelled from label 1 to label 97. Label 6 and label 20 are the most appeared characters where both characters are used in different centuries with slight variations. Label 89 is less appeared and it is not much used, whereas in multi-segmented the characters are labelled from label 98 to label 1045.Most of them are appeared only once and they are fully composed of multi strokes. Each label in the dataset represents different pixel properties extracted from the corresponding images. With 1045 labels, the dataset captures a diverse range of character variations. The images are of size 40 × 32 pixels, and the pixel properties are averaged to provide meaningful feature representations. It is designed to aid research in historical manuscript recognition, optical character recognition (OCR), and Tamil script analysis. The dataset includes high-resolution binarized images along with annotations that provide segmentation information. The images are pre-processed to enhance clarity, making them suitable for deep learning applications in character recognition. It is an unbalanced dataset and it requires augmentation and sampling for the better accuracy. The folder structure inside are the ZIP file format.
Files & Data Organization:
• Images: High-resolution palm leaf manuscript images (.PNG format).
• Annotations: Image files containing ground truth text are manually annotated and segmentation details.
• Single Characters: Segmented images of individual Tamil Characters.
• Multi Segmented Characters: Segmented images of Composite Tamil characters.
Methodology:
• Data sourced from Tamil palm leaf manuscripts from Tamil Virtual Academy (TVA).
• Preprocessing steps include noise reduction and Binarization.
• Characters are Semantically segmented and categorized based on the ground truth.
Ethical Considerations:
This dataset contains historical materials and it should be used with proper attribution. The dataset is shared for research purposes only, and modifications that misrepresent its content are not permitted.
Funding & Acknowledgments:
This dataset is developed by SASTRA Deemed University with the support of Tamil Virtual Academy (TVA) for providing access to manuscript images.
创建时间:
2025-03-24



