DISTINGUO : Ground truth for Handwritten Text Recognition (HTR) on Collections of Distinctions (late 13th to late 15th century)
收藏DataCite Commons2025-06-29 更新2025-04-16 收录
下载链接:
https://nakala.fr/10.34847/nkl.48ad8b8d
下载链接
链接失效反馈官方服务:
资源简介:
This dataset includes images of manuscripts containing collections of distinctions and their corresponding PAGE files with normalized transcription. The data was prepared for training kraken models during the DISTINGUO project (2019–2024). The latter was dedicated to the study of distinctiones in medieval Latin preaching and led by Marjorie Burghart.
The manuscript zone segmentation mostly follows the SegmOnto ontology (with one exception: tags such as `Main_col_1`, `Main_col_2`, etc., were used to differentiate columns).
The transcription is normalized.
Script families: The dataset includes the following script families: textualis libraria, semitextualis libraria, semihybrida libraria, textualis currens, cursiva libraria, bastarda.
Number of hands: According to our estimates, each of the provided manuscript fragments was written by a single scribe.
Language: The texts are primarily written in Latin, but they occasionally include very rare insertions in Old French — no more than one phrase per folio.
Transcription guidelines:
- All abbreviations have been expanded, except for cases where biblical citations are reduced to a few letters of each word (e.g., "Redemisti me. do. de. ue." instead of "Redemisti me, Domine Deus ueritatis").
- Proper nouns and the word Deus have been capitalized.
- Both "v" and "u: have been transcribed as "u", except in Roman numerals.
- Punctuation has been partially normalized. Symbols such as `.`, `:`, and `/` have been transcribed as either `,` or `.` depending on the context.
The creator of the dataset expresses her gratitude to Vassily Dolgopolov for his assistance in the study of the manuscripts.
提供机构:
NAKALA - https://nakala.fr (Huma-Num - CNRS)
创建时间:
2024-12-30



