five

Post-OCR-Correction

收藏
魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/Post-OCR-Correction
下载链接
链接失效反馈
官方服务:
资源简介:
**Post-OCR correction** is a large corpus of 1 billion words containing original texts with a varying number of OCR mistakes and an experimental multilingual post-OCR correction output created by Pleias. Generation of Post-OCR correction was performed using HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay. ## Description All the texts come from collections integrated into *Common Corpus*, the largest open corpus for pretraining previously released by Pleias on HuggingFace. The corpus comprises cultural heritage texts in French, English, German and Italian with the following distribution: * French: newspaper texts from Gallica, 438,034,960 words. * English: newspaper texts from Chronicling America, 300,522,681 words. * Italian: monographs texts from various sources, notably Internet Archive, 144,441,539 words. * German: monographs texts from various sources, notably Internet Archive, 97,396,147 words. OCR quality was a major limitation regarding the potential reuse of Common Corpus for training AI models and cultural analytics research. Promising results of post-ocr correction shows the resource can be significantly enhanced in this aspect. ## Example Original excerpt with many OCR errors from the Omaha Bee (June 25, 1890): > "THE OMAHA ! DAILY BEE. > > TWENTIETH YEAR. OMAHA. WEDNESDAY JMjgNING. ( ! JUNE 25. 1890. NUMBER 7. > > LICKED UP BY THE FLAMES , An Incendiary Wreaks His Vengeance o Blue Hill , Nebraska. NEARLY TWENTY STORES BLOTTED OUT , Tlio Amount of lnmnc Done Iloimlily Kutlmnted .nt Over Fifty Thousand DollurH , With Comparatively Little Insurance. > > BLUB HIM , Neb. , Juno 24. ( Special Tele-pram to TUB BBK. ) At 2M : this morning a.flro broke out simultaneously In two places on the north sldo of Main street in Blue Hill. The ono at the opera house , nt almost the ex treme cast end of the street , was extinguished by the efforts of O. C. 1C. Lolgman , Mrs. B. II. Munson and the girl help at the Muuson Louse. I" Correction by Pleias: > "THE OMAHA DAILY BEE. > > TWENTIETH YEAR. OMAHA, WEDNESDAY MORNING. JUNE 25, 1890. NUMBER 7. > > LICKED UP BY THE FLAMES, > > An Incendiary Wreaks His Vengeance on Blue Hill, Nebraska. > > NEARLY TWENTY STORES BLOTTED OUT, > > The Amount of Damage Done Is Estimated at Over Fifty Thousand Dollars, With Comparatively Little Insurance. > > BLUE HILL, Neb., June 24. (Special Telegraph to THE BEE.) At 2:30 this morning a fire broke out simultaneously in two places on the north side of Main street in Blue Hill. The one at the opera house, at almost the extreme east end of the street, was extinguished by the efforts of O. C. J. Longman, Mrs. B. H. Munson and the girl help at the Munson House." ## Potential use As part of Pleias commitment to open science, this release aims to collectively assess the quality of post-OCR correction process, prior to the release of our post-OCR correction LLM-based models. While the quality of the corrected text is higher than any other approach test to date, LLM-based correction is probability-based and estimated correction can introduce words/corrections not present in the original text, especially if the OCR is of poor quality, or omit some part of the original text. Potential downstream use of post-OCR correction includes: * Assisting manual correction that would require a higher level of accuracy (for instance on Wikisource). * Classification tasks due to a higher rate of recognized words to predict the genre/topic of a text. * Deduplication tasks due to a higher rate of recognized words to assess whether two texts are identical.

**OCR后校正 (Post-OCR correction)** 是一款包含10亿词元的大型语料库,收录了带有不同数量OCR识别错误的原始文本,以及由Pleias团队开发的实验性多语言OCR后校正输出结果。 本数据集的构建依托GENCI–IDRIS(资助编号2023-AD011014736)在Jean-Zay超算平台上的高性能计算(High Performance Computing,简称HPC)资源完成。 ## 数据集概况 所有文本均来自整合至**通用语料库 (Common Corpus)**的数据集,该语料库是Pleias此前在HuggingFace平台发布的、用于预训练的规模最大的开源语料库。本语料库包含法语、英语、德语、意大利语的文化遗产文本,分布情况如下: * 法语:来自Gallica的报纸文本,共计438,034,960词元。 * 英语:来自Chronicling America的报纸文本,共计300,522,681词元。 * 意大利语:来自多个渠道的专著文本,主要来自互联网档案馆 (Internet Archive),共计144,441,539词元。 * 德语:来自多个渠道的专著文本,主要来自互联网档案馆 (Internet Archive),共计97,396,147词元。 OCR识别质量曾是限制通用语料库在AI模型预训练与文化分析研究中复用的主要瓶颈。OCR后校正技术已展现出可观的应用前景,可显著提升该语料库的资源质量。 ## 示例 来自《奥马哈蜜蜂报》(1890年6月25日)的含大量OCR识别错误的原始文本节选: > "THE OMAHA ! DAILY BEE. > > TWENTIETH YEAR. OMAHA. WEDNESDAY JMjgNING. ( ! JUNE 25. 1890. NUMBER 7. > > LICKED UP BY THE FLAMES , An Incendiary Wreaks His Vengeance o Blue Hill , Nebraska. NEARLY TWENTY STORES BLOTTED OUT , Tlio Amount of lnmnc Done Iloimlily Kutlmnted .nt Over Fifty Thousand DollurH , With Comparatively Little Insurance. > > BLUB HIM , Neb. , Juno 24. ( Special Tele-pram to TUB BBK. ) At 2M : this morning a.flro broke out simultaneously In two places on the north sldo of Main street in Blue Hill. The ono at the opera house , nt almost the ex treme cast end of the street , was extinguished by the efforts of O. C. 1C. Lolgman , Mrs. B. II. Munson and the girl help at the Muuson Louse. I" Pleias团队的校正结果: > "THE OMAHA DAILY BEE. > > TWENTIETH YEAR. OMAHA, WEDNESDAY MORNING. JUNE 25, 1890. NUMBER 7. > > LICKED UP BY THE FLAMES, > > An Incendiary Wreaks His Vengeance on Blue Hill, Nebraska. > > NEARLY TWENTY STORES BLOTTED OUT, > > The Amount of Damage Done Is Estimated at Over Fifty Thousand Dollars, With Comparatively Little Insurance. > > BLUE HILL, Neb., June 24. (Special Telegraph to THE BEE.) At 2:30 this morning a fire broke out simultaneously in two places on the north side of Main street in Blue Hill. The one at the opera house, at almost the extreme east end of the street, was extinguished by the efforts of O. C. J. Longman, Mrs. B. H. Munson and the girl help at the Munson House." ## 潜在应用场景 作为Pleias对开放科学的承诺之一,本数据集的发布旨在在我们基于大语言模型 (Large Language Model,简称LLM) 的OCR后校正模型正式发布前,联合学界共同评估OCR后校正流程的质量。 尽管经校正后的文本质量优于目前已公开的所有同类方法,但基于大语言模型的校正方法是基于概率的,其预估的校正结果可能会引入原始文本中不存在的词元/修正内容,尤其是在OCR识别质量较差,或原始文本存在部分缺失的情况下。 OCR后校正数据集的潜在下游应用包括: * 辅助对精度要求较高的人工校正工作(例如Wikisource场景)。 * 分类任务:经校正后可识别词元占比更高,可用于预测文本的体裁与主题。 * 去重任务:经校正后可识别词元占比更高,可用于判断两段文本是否完全一致。
提供机构:
maas
创建时间:
2025-06-19
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Post-OCR-Correction是一个大规模多语言OCR后校正语料库,包含10亿单词,涵盖法语、英语、意大利语和德语的文化遗产文本(如报纸和专著),提供原始带OCR错误的文本及校正后的版本。该数据集旨在评估和提升OCR后校正质量,可用于辅助手动校正、文本分类和去重等下游任务,基于Common Corpus构建并使用高性能计算资源生成。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作