jeanflop/post_ocr_correction2

Name: jeanflop/post_ocr_correction2
Creator: jeanflop
Published: 2024-10-28 11:08:27
License: 暂无描述

Hugging Face2024-10-28 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/jeanflop/post_ocr_correction2

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个用于OCR后校正任务的合成数据集，包含超过2,000,000行的法语文本对，采用Croissant格式。数据集通过随机应用多种变换来模拟OCR错误文本，以训练小型语言模型（LLMs）进行文本校正。变换方法包括删除元音、替换多个空格为单个空格、删除单个字母、删除标点符号、随机删除字符、随机打乱单词等。每个单词有60%的概率被选中进行变换，每个文本中30%到60%的单词会被修改。数据集旨在通过模拟OCR错误文本，帮助训练模型在上下文基础上选择正确的单词。

This dataset is a synthetic dataset generated for post-OCR correction tasks. It contains over 2,000,000 rows of French text pairs and follows the Croissant format. It is designed to train small language models (LLMs) for text correction. The dataset applies various transformations randomly to simulate OCR-malformed texts, such as removing vowels, replacing multiple spaces with a single space, removing single letters, removing punctuation, randomly dropping characters, and scrambling words. Each word has a 60% chance of being selected for alteration, and between 30% and 60% of the words in each text can be altered. The dataset aims to help train models to select the correct word based on the surrounding context by simulating OCR errors.

提供机构：

jeanflop

5,000+

优质数据集

54 个

任务类型

进入经典数据集