AnnikaSimonsen/FO-OCRtrain

Name: AnnikaSimonsen/FO-OCRtrain
Creator: AnnikaSimonsen
Published: 2023-05-01 14:47:05
License: 暂无描述

Hugging Face2023-05-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/AnnikaSimonsen/FO-OCRtrain

下载链接

链接失效反馈

官方服务：

资源简介：

This is artificial Faroese OCR training data created by collection real OCR errors and inserting them into 38 million tokens of non-OCRed text. The parallel data is set up as a TSV file with the first column (fo_err) being the text with OCR errors, while the second column (fo_corr) is without OCR errors. This dataset was created by using scripts from https://github.com/atlijas/ocr-post-processing. Two ByT5 models have been fine-tuned with the data: https://huggingface.co/svanhvit/byt5-ocr-post-processing-faroese-ai-yfirlestur and https://huggingface.co/svanhvit/byt5-ocr-post-processing-faroese This is a work on progress and more version of the dataset and the models are on the way.

提供机构：

AnnikaSimonsen

原始信息汇总