Dataset and evaluation for HTR models for Latin and French Medieval Documentary Manuscripts
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7401832
下载链接
链接失效反馈官方服务:
资源简介:
1. Dataset presentation.
This is the dataset used to produce the HTR models applied to documentary Latin and French manuscripts presented in the paper: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval
Manuscripts. 2022. https://hal.science/hal-03892163
The dataset contains mostly charters and registers from the Late-medieval period (12th-15th). The training and evaluation, entailing 1855 pages, 120k lines of text and almost 1M tokens, were conducted using three freely available ground-truth corpora :
The Alcar-HOME database : https://zenodo.org/record/5600884
The e-NDP corpus : https://github.com/chartes/e-NDP_HTR
The Himanis project : https://zenodo.org/record/5535306
The final model operates in a multilingual environment (Latin and French) and it is able to recognize several Latin script families (mostly Textualis and Cursiva) in documents produced in ca. 12th - 15th centuries. During the evaluation the models shows an accuracy of 94.01% on the validation set and a CER (character error ratio) of about 0.12 to 0.17 on four external unseen datasets. A fine-tuning exercise using 10 ground-truth pages can raise these results to a CER between 0.06 to 0.10 respectively.
2. Dataset contents .
a) GT_list : List containing the GT file names which constitute the training, evaluation and test sets. The images and transcriptions can be downloaded from their original repositories.
b) Training : Contains the training and testing results (evaluation and prediction files) presented in the original paper for the two training phases: Regular (Textualis and Cursiva separated training) and Quartiles (mixed training by quartiles).
c) Useful_scripts : Scripts to produce the HTR metrics (CER, WER, SER) and plot the model's accuracy.
d) Best_model : Contains the best multilingual and multi-script model.
创建时间:
2023-01-18



