Dataset and evaluation for HTR models for Latin and French Medieval Documentary Manuscripts

NIAID Data Ecosystem2026-03-14 收录

下载链接：

https://zenodo.org/record/7401832

下载链接

链接失效反馈

官方服务：

资源简介：

1. Dataset presentation. This is the dataset used to produce the HTR models applied to documentary Latin and French manuscripts presented in the paper: Sergio Torres Aguilar, Vincent Jolivet. Handwritten Text Recognition for Documentary Medieval Manuscripts. 2022. https://hal.science/hal-03892163 The dataset contains mostly charters and registers from the Late-medieval period (12th-15th). The training and evaluation, entailing 1855 pages, 120k lines of text and almost 1M tokens, were conducted using three freely available ground-truth corpora : The Alcar-HOME database : https://zenodo.org/record/5600884 The e-NDP corpus : https://github.com/chartes/e-NDP_HTR The Himanis project : https://zenodo.org/record/5535306 The final model operates in a multilingual environment (Latin and French) and it is able to recognize several Latin script families (mostly Textualis and Cursiva) in documents produced in ca. 12th - 15th centuries. During the evaluation the models shows an accuracy of 94.01% on the validation set and a CER (character error ratio) of about 0.12 to 0.17 on four external unseen datasets. A fine-tuning exercise using 10 ground-truth pages can raise these results to a CER between 0.06 to 0.10 respectively. 2. Dataset contents . a) GT_list : List containing the GT file names which constitute the training, evaluation and test sets. The images and transcriptions can be downloaded from their original repositories. b) Training : Contains the training and testing results (evaluation and prediction files) presented in the original paper for the two training phases: Regular (Textualis and Cursiva separated training) and Quartiles (mixed training by quartiles). c) Useful_scripts : Scripts to produce the HTR metrics (CER, WER, SER) and plot the model's accuracy. d) Best_model : Contains the best multilingual and multi-script model.

创建时间：

2023-01-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集