five

l3ipp/pleias-post-ocr-aligned-paragraph-en

收藏
Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/l3ipp/pleias-post-ocr-aligned-paragraph-en
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是`PleIAs/Post-OCR-Correction`的一个经过清理和重新对齐的段落级版本,用于OCR后校正训练。原始数据集包含噪声OCR文本和校正文本的对,但其中一些对可能包含幻觉、遗漏或源与目标之间的弱对齐。这个派生版本通过对齐管道将数据重新构建为段落级对齐的示例,并可选地修剪以删除匹配不佳的内容。目标是提供更忠实于原始OCR输入的训练示例,因此更适合于有监督的OCR后校正。

This dataset is a cleaned and re-aligned paragraph-level version of `PleIAs/Post-OCR-Correction`, prepared for post-OCR correction training. The original dataset contains pairs of noisy OCR text and corrected text. However, some pairs include hallucinations, omissions, or weak alignment between source and target. This derived version restructures the data into paragraph-level aligned examples using an alignment pipeline, with optional pruning to remove poorly matched content. The goal is to provide training examples that are more faithful to the original OCR input and therefore more suitable for supervised post-OCR correction.
提供机构:
l3ipp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作