Handwritten Text Recognition Training and Test Set for German Kurrent of the 19th century
收藏Zenodo2025-10-02 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.17252677
下载链接
链接失效反馈官方服务:
资源简介:
German Kurrent Handwritten Text Lines (19th Century) for HTR Model Training
This dataset comprises handwritten manuscripts in 19th-century German Kurrent, prepared for the training of a Handwritten Text Recognition (HTR) model. It contains a total of 9,317 text lines. The data was sourced from the following repositories:
Senatsprotokolle: https://github.com/ubtue/Ground-Truth/tree/main/Senatsprotokolle
Digitale Schriftkunde (Bayerisches Hauptstaatsarchiv):
1806 Reichsstadt Nürnberg, Ratsverläße
1824 Maria Theresia, Privatkorrespondenz
1894 Heroldenamt Akten
Deutsches Textarchiv (DTA):
Libelt, Kosmos-Vorlesungen
Hufeland, Privatbesitz 1829
Erbkam, Tagebuch 1842
Auerbach, Sanders 1869
Auerbach, Sanders 1880
Auerbach, Sanders II 1880
For more details, see the README file of each dataset in the `data/pages/datasetname` folder.
Data Structure
The dataset is organized as follows:
lines/
TestSet/
PNG files: Line images
PageXML files: Transcriptions
TrainingSet/
PNG files: Line images
PageXML files: Transcriptions
ValidationSet/
PNG files: Line images
PageXML files: Transcriptions
pages/
DatasetName/
annotatedJpeg/: full-page images with baselines and text areas visible
pngAndXml/: page images with corresponding PageXML
README.md: dataset-specific metadata and description
Transcription Guidelines
Transcriptions were obtained from the original sources and adapted to follow the OCR-D Level 2 transcription guidelines to the best of the contributor’s knowledge and ability. Disclaimer: I am not a professional linguist and do not read Kurrent fluently. Although care was taken to apply OCR-D Level 2 rules consistently, transcription errors or oversights cannot be fully excluded.
Line Detection
For all datasets except the Senatsprotokolle (which already contained line annotations), line detection was performed automatically using Transkribus, followed by manual correction. Each line was extracted with ascenders and descenders fully included in the text region, while minimizing overlap with adjacent lines.Line Extraction
Line extraction was performed using a Python script, available here.
License
Deutsches Textarchiv data: All content is released under CC BY 4.0.
Bayerische Schriftkunde data:
- Digital reproductions (images): CC0 / Public Domain Mark, per Staatliche Archive Bayerns terms.
- Editorial content and transcriptions: CC BY-NC-SA 4.0
提供机构:
Zenodo
创建时间:
2025-10-02



