five

mschonhardt/georges-1913-normalization

收藏
Hugging Face2024-12-03 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/mschonhardt/georges-1913-normalization
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是作为*Burchards Dekret Digital*项目的一部分创建的,基于Karl Georges的《Ausführliches lateinisch-deutsches Handwörterbuch》(Georges 1913)中的55,000个词条,用于训练中世纪拉丁语文本标准化任务的模型。数据集包含约500万对拼写变体及其标准化形式的词对,这些词对通过引入系统的拼写转换生成,如`v ↔ u`替换、`ii → ij` / `ii → ji`替换、`ae → ę`替换等。数据集格式为制表符分隔的词对,每行包含一个拼写变体及其标准化形式。

This dataset was created as part of the Burchards Dekret Digital project, funded by the Academy of Sciences and Literature | Mainz. It is based on 55,000 lemmata from Karl Georges Latin-German dictionary (1913 edition) and aims to train models for normalizing medieval Latin texts. The dataset contains approximately 5 million pairs of orthographic variants and their normalized forms, generated by introducing systematic orthographic transformations such as v and u substitutions, ii to ij/ji replacements, etc. This dataset is suitable for training normalization models, developing text restoration tools, and applying normalization based on Georges 1913.
提供机构:
mschonhardt
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作