jeanflop/post_ocr_correction-512
收藏Hugging Face2024-11-02 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/jeanflop/post_ocr_correction-512
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个为OCR后校正任务生成的合成数据集,包含超过1,000,000行的法语文本对,并遵循Croissant格式。它设计用于训练小型语言模型进行文本校正。为了确保数据集与OCR错误文本相似,应用了多种随机变换,如删除元音、替换多个空格为单个空格、删除单个字母、删除标点符号、随机删除字符和随机打乱单词等。此外,还修改了标点符号、添加了单词并创建了重复。每个单词有60%的几率被选中进行变换,每个文本中30%到60%的单词可以被变换。
This dataset is a synthetic dataset generated for post-OCR correction tasks. It contains over 1,000,000 rows of French text pairs and follows the Croissant format. It is designed to train small language models (LLMs) for text correction. To ensure the dataset closely resembles OCR-malformed texts, various random transformations were applied, such as removing vowels, replacing multiple spaces, removing single letters, removing punctuation, randomly dropping characters, and randomly scrambling words. Additionally, punctuation was modified, words were added, and repetitions were created. Each word in the text has a 60% chance of being selected for alteration, with a random number of alterations applied, resulting in 30% to 60% of the words in each text potentially being altered.
提供机构:
jeanflop



