AnnikaSimonsen/FO-OCRtrain
收藏Hugging Face2023-05-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/AnnikaSimonsen/FO-OCRtrain
下载链接
链接失效反馈官方服务:
资源简介:
This is artificial Faroese OCR training data created by collection real OCR errors and inserting them into 38 million tokens of non-OCRed text.
The parallel data is set up as a TSV file with the first column (fo_err) being the text with OCR errors, while the second column (fo_corr) is without OCR errors.
This dataset was created by using scripts from https://github.com/atlijas/ocr-post-processing.
Two ByT5 models have been fine-tuned with the data: https://huggingface.co/svanhvit/byt5-ocr-post-processing-faroese-ai-yfirlestur and https://huggingface.co/svanhvit/byt5-ocr-post-processing-faroese
This is a work on progress and more version of the dataset and the models are on the way.
提供机构:
AnnikaSimonsen
原始信息汇总
数据集概述
数据集描述
- 该数据集是人工合成的法罗语OCR训练数据,通过收集真实的OCR错误并将其插入到3800万未经OCR处理的文本中创建。
数据结构
- 数据以TSV文件格式组织,包含两个主要列:
fo_err:包含OCR错误的文本。fo_corr:不含OCR错误的文本。
数据集创建
- 数据集是通过使用来自https://github.com/atlijas/ocr-post-processing的脚本创建的。
模型应用
- 该数据集已被用于微调两个ByT5模型:
数据集状态
- 该数据集目前处于进展中,未来将会有更多版本的数据集和模型发布。



