mschonhardt/georges-1913-normalization
收藏Hugging Face2024-12-03 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/mschonhardt/georges-1913-normalization
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是作为*Burchards Dekret Digital*项目的一部分创建的,基于Karl Georges的《Ausführliches lateinisch-deutsches Handwörterbuch》(Georges 1913)中的55,000个词条,用于训练中世纪拉丁语文本标准化任务的模型。数据集包含约500万对拼写变体及其标准化形式的词对,这些词对通过引入系统的拼写转换生成,如`v ↔ u`替换、`ii → ij` / `ii → ji`替换、`ae → ę`替换等。数据集格式为制表符分隔的词对,每行包含一个拼写变体及其标准化形式。
This dataset was created as part of the Burchards Dekret Digital project, funded by the Academy of Sciences and Literature | Mainz. It is based on 55,000 lemmata from Karl Georges Latin-German dictionary (1913 edition) and aims to train models for normalizing medieval Latin texts. The dataset contains approximately 5 million pairs of orthographic variants and their normalized forms, generated by introducing systematic orthographic transformations such as v and u substitutions, ii to ij/ji replacements, etc. This dataset is suitable for training normalization models, developing text restoration tools, and applying normalization based on Georges 1913.
提供机构:
mschonhardt



