five

TranscriboQuest 2024 Medieval Literary

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13757439
下载链接
链接失效反馈
官方服务:
资源简介:
# Team Medieval Literary: Documentation## Context of the datasetThis dataset was created in the context of TranscriboQuest 2024 (Medieval Literary Team) held in Lyon (11/09/2024-13/09/2024). The goal of this summer school was to get more acquainted with eScriptorium and produce a limited, but qualitative dataset of medieval manuscripts. The aim of this dataset was to contribute to underrepresented aspects of the manuscripts used in the CATMuS project. We opted to focus on medieval scientific documents that are damaged, in several different languages. The result is 808 lines transcribed by experts in the field.  The dataset contains the images of the manuscripts and ALTO-XMLs. A general overview of manuscripts chosen is included in the readme. ## DamageOne of the main features of the dataset was the damage done to the manuscript, which was expected to have some impact on automatic segmentation and transcription of the data. The short description of the type of the damage and it's actual impact on the manuscript is presented in the table in the readme.As expected, damage caused some problems, mostly with segmentation and transcription. The more detailed description of both processes and challenges encountered during them is as follows.## SegmentationFor identifying the zones of the manuscripts and segmenting the lines, bIIa.mlmodel was used. Afterwards, all zones were manually corrected. The largest challenge for the model proved to be the various different layouts of the medieval manuscripts. The model often identifies multiple columns as one text zone, which had to be manually corrected. Several reasons complicated this task: such as two columns being located closely together, resulting in incorrect numbering of lines (Toulouse, Bibliothèque d’Étude et du Patrimoine, 872), or three columns being written in an irregular way (BN rps 12533 II), resulting in wrong segmentation of them; the damaged state of certain folia (NF Français 12444). Irregular zones also lead to incorrect line ordering, which needed manual correction (Paris, BnF, latin 9380; Toulouse, Bibliothèque d’Étude et du Patrimoine, 872). Last, there was also a problem in KBR IV 398 / 5: the scribe of this document would leave an open space after the line of a verse. The model would identify this as a separate line.The SegmOnto Guidelines (https://segmonto.github.io/) were followed to name the various zones of the manuscripts. Initials encompassing more than one line were included in the text line, as the next version of the CATMuS model will be able to read this. The applied zones are: MainZone, StampZone, NumberingZone, MarginTextZone, DropCapitalZone, DecorationZone, DamageZone.## TranscriptionThe CATMuS guidelines were followed when transcribing the documents (https://catmus-guidelines.github.io/). The aim was to mimic the original documents as closely as possible, thus producing graphemic transcriptions. Allographic variants such as 'u'/'v' and 'i'/'j' were normalized. Abbreviations were not developped and medieval characters were continuously transcribed adhering to MUFI. To assure this, precomposed keyboards by CATMuS were used. Spaces were added where they would be semantically. For a more detailed description of the guidelines, we refer to the CATMuS guidelines.To initially transcribe the manuscripts, CATMuS Medieval 1.5.0, Pinche et al. (DOI: 10.5281/zenodo.1274323) was applied. It has an accuracy rate of 95.1% (congruent to the average confidence score). Upon manual correction, this seemed to be lower, given the damaged nature of the manuscripts. The model had difficulty reading damaged parts (KBR IV 398 / 5; BNF Français 12444), faded ink (Toulouse 872), and identifying spaces (KBR IV 398 / 5; Paris, BnF, latin 9380). All errors were corrected manually and the provided transcriptions can be considered Ground Truth. Common errors were confusion between 'u', 'm' and 'i' (BNF Français 12444; BnF latin 930; KBR IV 398 / 5; BN rps 12533 II) as well as between 'e', 'o' and 'c' (BnF latin 930; KBR IV 398 / 5). A stroke was also often added to 'p', due to the script of certain scribes (BNF Français 12444;  KBR IV 398 / 5). A stroke was added to an 'l' at the end of a word, where a vertical tilde was required (LJS 216). In the BN rps 12533 II the main issue turned out to be the lack of special apothecary signs used by the author (for units such as dram ʒ (M+F2E6), scrupule ℈ (U+2108), or numbers such as Roman half ɟ (U+025F), etc.) in the CATMuS model. They were found and copied manually from the Inventaire des typèmes latins et français existant dans Unicode/MUFI ou à y faire entrer (http://jacques-andre.fr/PICA/SIGMA-PICA.pdf).When choosing abreviation signs, some cases proved to be problematic: a tilde covering two letters can only be inserted on one letter; or one separate tilde can be put on each letter, which is not efficient. Being able to insert the long s with a stroke is odd compared to the fact that non-stroked s are not transcribed with a long s. In the different available keyboards, the distinction between superscript letters and combining letters is not completely clear when it comes to chose between the two : consequences of this choice should be explained somewhere.## Creators of the datasetJessie Dummer, Emmanuelle Kuhry, Sylvain Besson, Zdzislaw Koczarski, Caroline Chevalier-Royet and Caroline Vandyck under supervision of Matthias Gille Levenson
创建时间:
2024-09-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作