jeanflop/post-ocr-correction
收藏Hugging Face2024-11-02 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/jeanflop/post-ocr-correction
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个用于OCR后校正任务的合成数据集,包含超过2,000,000行的法语文本对,并遵循Croissant格式。数据集通过应用各种随机变换来模拟OCR错误文本,以训练小型语言模型进行文本校正。变换包括删除元音、替换多个空格、删除单个字母、删除标点符号、随机删除字符以及随机打乱单词等。每个单词有50%的概率被选中进行变换,每个文本中10%到50%的单词会被变换。
This dataset is a synthetic dataset generated for post-OCR correction tasks. It contains over 2,000,000 rows of French text pairs and follows the Croissant format. The dataset is designed to train small language models (LLMs) for text correction by applying various random transformations to simulate OCR-malformed texts. These transformations include removing vowels, replacing multiple spaces with a single space, removing single letters, removing punctuation, randomly dropping characters, and randomly scrambling words. Additionally, punctuation is modified, words are added, and repetitions are created. Each word in the text has a 50% chance of being selected for alteration, and a random number of alterations is applied, ranging from 10% to 50% of the words in each text.
提供机构:
jeanflop



