mschonhardt/georges-1913-normalization

Name: mschonhardt/georges-1913-normalization
Creator: mschonhardt
Published: 2024-12-03 08:44:18
License: 暂无描述

Hugging Face2024-12-03 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/mschonhardt/georges-1913-normalization

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是作为*Burchards Dekret Digital*项目的一部分创建的，基于Karl Georges的《Ausführliches lateinisch-deutsches Handwörterbuch》（Georges 1913）中的55,000个词条，用于训练中世纪拉丁语文本标准化任务的模型。数据集包含约500万对拼写变体及其标准化形式的词对，这些词对通过引入系统的拼写转换生成，如`v ↔ u`替换、`ii → ij` / `ii → ji`替换、`ae → ę`替换等。数据集格式为制表符分隔的词对，每行包含一个拼写变体及其标准化形式。

This dataset was created as part of the Burchards Dekret Digital project, funded by the Academy of Sciences and Literature | Mainz. It is based on 55,000 lemmata from Karl Georges Latin-German dictionary (1913 edition) and aims to train models for normalizing medieval Latin texts. The dataset contains approximately 5 million pairs of orthographic variants and their normalized forms, generated by introducing systematic orthographic transformations such as v and u substitutions, ii to ij/ji replacements, etc. This dataset is suitable for training normalization models, developing text restoration tools, and applying normalization based on Georges 1913.

提供机构：

mschonhardt

5,000+

优质数据集

54 个

任务类型

进入经典数据集