five

Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Datasets and scripts

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7386489
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains the dataset of the article "Towards a general open dataset and models for late medieval Castilian writing (HTR/OCR)" submitted to the Journal of Data Mining and Digital Humanities (JDMDH). I refer to the paper (https://doi.org/10.5281/zenodo.7387376) for the description of the corpus and the models. The dataset is in version V2: it contains the allographetic AND graphematic transcriptions (files `*.normalized.xml`) and models. Caveat: the allographetic transcriptions and models only are described in the data paper mentionned above. The graphematic transcriptions are produced using a Chocomuffin conversion table (see `corpus/conversion_table.csv`) to reduce each allograph to its corresponding grapheme. The abbreviations are not expanded. Please cite the following paper if you use this dataset or the models: @article{gille_levenson_2023_towards, author = {Gille Levenson, Matthias}, date = {2023}, journaltitle = {Journal of Data Mining and Digital Humanities}, doi = {10.46298/jdmdh.10416}, editor = {Pinche, Ariane and Stokes, Peter}, issuetitle = {Special Issue: Historical documents and automatic text recognition}, title = {Towards a general open dataset and models for late medieval Castilian text recognition(HTR/OCR)}, GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castiliantext recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : SpecialIssue : Historical documents and automatic text recognition, eds. Ariane PINCHE and PeterSTOKES, DOI : 10.46298/jdmdh.10416. The image of the manuscript M (Esc_M) has not yet been uploaded, pending permission from the library that keeps the manuscript. All images are kept in a directory named after the place where the manuscript is kept, and the sigla of the witness for the in-domain dataset.   The global licence for the dataset (except for images) is CC-BY-NC-SA. All manuscripts reproductions are published with the authorization of the libraries. ©Biblioteca General Histórica de Salamanca Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2709 (L) Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2097 (J) Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2673 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2011 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2654 Universidad de Salamanca (España), Biblioteca General Histórica, Ms. 2086 ©Museo Lázaro Galdiano. Madrid Inv. 15304, Fundación Lázaro Galdiano (A) ©Universidad de Valladolid Ms. 251, Biblioteca Santa Cruz (S) ©Real Biblioteca del Escorial Ms. K.I.5, Biblioteca del Real Monasterio del Escorial (Q) Ms. h.I.8, Biblioteca del Real Monasterio del Escorial (M): to be published Ms. Z-I-12 Ms.Z-III-9 Ms. X-III-4 Ms. h-III-9 Ms. b-IV-15 Ms. b-II-11 Ms. a-II-17 Ms. T-III-5 ©Rosenbach Foundation Ms. 482/2 (U) © Gallica.bnf.fr Espagnol 12 Espagnol 36 Espagnol 218 © Bodleian Library Ms. Span. d. 1 Ms. Span. d. 2/1 © Biblioteca Real, Madrid Ms. II/215 (G) © Biblioteca Nacional de España Mss/4183 Inc/901 (Z) © Biblioteca Universitaria, Sevilla Ms. 332/131 (R)   Edit: add result files
创建时间:
2023-10-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作