Automatic transcriptions of distinctiones collections produced in the course of the Distinguo project

Name: Automatic transcriptions of distinctiones collections produced in the course of the Distinguo project
Creator: NAKALA - https://nakala.fr (Huma-Num - CNRS)
Published: 2025-07-06 20:39:18
License: 暂无描述

DataCite Commons2025-07-06 更新2025-04-16 收录

下载链接：

https://nakala.fr/10.34847/nkl.d9c14mcs

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains automatic transcriptions of manuscripts and early printed books featuring collections of distinctiones, specifically: "Dictionarium bovis" by Thomas of Pavia, "Distinctiones" by Vincent Ferrer, "Summa" by Guy of Evreux, and "Alphabetum in artem sermocinandi" by Peter of Capua. The transcriptions were prepared during the DISTINGUO project (2019–2024), which focused on the study of distinctiones in medieval Latin preaching and was led by Marjorie Burghart. "Dictionarium bovis" was transcribed based on the manuscripts held at Florence, Biblioteca Medicea Laurenziana, Mss S. Crucis, Plut. XXVIII sin., cod. 2–6. It was segmented into regions using a model developed during the DISTINGUO project for analyzing double-column manuscript layouts. This model, inspired by the SegmOnto ontology, was trained to recognize Main_col_1, Main_col_2, and DropCapitalZone. For line segmentation, a model by Thibault Clérice, fine-tuned on 300 pages from various "Dictionarium bovis" manuscripts, was used. Due to the manuscript's non-linear reading order, a dedicated Kraken model was trained on 90 pages of cod. 2 to determine the correct sequence of lines. No manual correction was applied after model-based segmentation. The transcriptions were generated using the Tridis_v2 model, fine-tuned on the first 60 pages of cod. 2. Ground truth for this training is included in the dataset: https://nakala.fr/10.34847/nkl.48ad8b8d. "Distinctiones" by Vincent Ferrer was transcribed based on the 1583 Lyon edition "Distinctiones Beati Vincentii Divini verbi praeconis". Both region and line segmentation were performed using a kraken model trained on the first 197 pages of the edition. Transcription was done via Transkribus and underwent light manual correction. "Summa" by Guy of Evreux was transcribed from the manuscript Mons, Bibliothèque publique, 1/103. The first 109 folios were segmented using a region segmentation model developed as part of the Passim project. The resulting annotations were corrected and used to train a custom YOLOv8 model capable of detecting Main_col_1, Main_col_2, Main_col_3, Main_col_4, DropCapitalZone, MarginTextZone, and QuireMarkZone. Line segmentation was performed using Thibault Clérice's model. Lines containing rubrics (e.g., In Nativitate Domini, In vigilia Nativitatis Domini) were labeled as HeadingLine:rubric, while all others were marked as DefaultLine. The transcription was produced using the Tridis_v2 model, fine-tuned on the first 9 folios of this manuscript. Training data was generated by aligning this manuscript with a transcription made by Marjorie Burghart of Paris, BnF, Latin 15966. The resulting transcription was not manually corrected. "Alphabetum" by Peter of Capua was segmented into regions from the manuscript Praha, Knihovna Národního muzea, XV A 3 (which contains the text up to the letter S). The first 190 pages were segmented using the Passim project's region segmentation model, manually corrected, and used to train a YOLO-based model limited to two types of Main_col.

提供机构：

NAKALA - https://nakala.fr (Huma-Num - CNRS)

创建时间：

2025-03-29