five

Handwritten Text Recognition Training and Test Set for German Kurrent of the 19th century

收藏
Zenodo2025-10-02 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.17252677
下载链接
链接失效反馈
官方服务:
资源简介:
German Kurrent Handwritten Text Lines (19th Century) for HTR Model Training This dataset comprises handwritten manuscripts in 19th-century German Kurrent, prepared for the training of a Handwritten Text Recognition (HTR) model. It contains a total of 9,317 text lines. The data was sourced from the following repositories: Senatsprotokolle: https://github.com/ubtue/Ground-Truth/tree/main/Senatsprotokolle Digitale Schriftkunde (Bayerisches Hauptstaatsarchiv): 1806 Reichsstadt Nürnberg, Ratsverläße 1824 Maria Theresia, Privatkorrespondenz 1894 Heroldenamt Akten Deutsches Textarchiv (DTA): Libelt, Kosmos-Vorlesungen Hufeland, Privatbesitz 1829 Erbkam, Tagebuch 1842 Auerbach, Sanders 1869 Auerbach, Sanders 1880 Auerbach, Sanders II 1880 For more details, see the README file of each dataset in the `data/pages/datasetname` folder. Data Structure The dataset is organized as follows: lines/ TestSet/ PNG files: Line images PageXML files: Transcriptions TrainingSet/ PNG files: Line images PageXML files: Transcriptions ValidationSet/ PNG files: Line images PageXML files: Transcriptions pages/ DatasetName/ annotatedJpeg/: full-page images with baselines and text areas visible pngAndXml/: page images with corresponding PageXML README.md: dataset-specific metadata and description Transcription Guidelines Transcriptions were obtained from the original sources and adapted to follow the OCR-D Level 2 transcription guidelines to the best of the contributor’s knowledge and ability. Disclaimer: I am not a professional linguist and do not read Kurrent fluently. Although care was taken to apply OCR-D Level 2 rules consistently, transcription errors or oversights cannot be fully excluded.   Line Detection For all datasets except the Senatsprotokolle (which already contained line annotations), line detection was performed automatically using Transkribus, followed by manual correction. Each line was extracted with ascenders and descenders fully included in the text region, while minimizing overlap with adjacent lines.Line Extraction Line extraction was performed using a Python script, available here.   License Deutsches Textarchiv data: All content is released under CC BY 4.0. Bayerische Schriftkunde data: - Digital reproductions (images): CC0 / Public Domain Mark, per Staatliche Archive Bayerns terms. - Editorial content and transcriptions: CC BY-NC-SA 4.0
提供机构:
Zenodo
创建时间:
2025-10-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作