nevmenandr/russian-old-orthography-ocr

Name: nevmenandr/russian-old-orthography-ocr
Creator: nevmenandr
Published: 2024-10-18 05:31:46
License: 暂无描述

Hugging Face2024-10-18 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/nevmenandr/russian-old-orthography-ocr

下载链接

链接失效反馈

官方服务：

资源简介：

数据集包含源图像和可读的提取文本。所有文本都是19世纪在俄罗斯出版的，使用的是改革前的正字法。该数据集旨在训练和评估1917年正字法改革前俄语出版物的光学字符识别系统。数据结构部分详细说明了图像和文本文件的命名规则和存储位置。旧正字法与现代正字法的不同之处在于它包含了四个被废除的字母和一些特定的拼写规则。这些字母和规则在一个专门用于将旧正字法转换为新正字法的Python包中被考虑。

The dataset contains source images and human-readable extracted texts. All texts were published in Russia in the 19th century and written using pre-reform orthography. The dataset is designed to train and evaluate optical character recognition systems for texts published in Russian before the orthographic reform (1917). The data structure section details the naming conventions and storage locations for image and text files. The old orthography differs from modern orthography in that it contains 4 letters that were removed from the Russian alphabet after the reform, as well as a set of later abolished specific rules for spelling words. These letters and rules are taken into account in a special package for Python designed to convert text from the old orthography to the new one.

提供机构：

nevmenandr

5,000+

优质数据集

54 个

任务类型

进入经典数据集