FrancophonIA/ICDAR_2019_Competition_Post-OCR_Text_Correction

Name: FrancophonIA/ICDAR_2019_Competition_Post-OCR_Text_Correction
Creator: FrancophonIA
Published: 2025-03-30 14:23:27
License: 暂无描述

Hugging Face2025-03-30 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/FrancophonIA/ICDAR_2019_Competition_Post-OCR_Text_Correction

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个面向ICDAR2019比赛中后OCR文本校正任务的语料库，包含大约2200万经过OCR识别的字符及其对应的黄金标准（GS）。这些文档来源于不同的数字收藏，包括法国国家图书馆（BnF）和大英图书馆（BL）。对应的GS来自BnF的内部项目和外部倡议，如Europeana Newspapers、IMPACT、Project Gutenberg、Perseus和Wikisource。数据集分为训练集和评估集，以及比赛结束后公开的全集。对于芬兰语数据集有特殊处理要求，不能在其他网站上重新共享。数据集的使用受到版权限制，非商业用途免费。

The dataset is a corpus for the ICDAR2019 competition on post-OCR text correction, containing about 22M OCRed characters along with the corresponding Gold Standard (GS). The documents are from different digital collections, including the National Library of France (BnF) and the British Library (BL). The GS comes from BnFs internal projects and external initiatives like Europeana Newspapers, IMPACT, Project Gutenberg, Perseus, and Wikisource. The dataset is divided into training and evaluation sets, with the full dataset made public after the competition. There are special requirements for the Finnish language dataset, which cannot be re-shared on other websites. The dataset usage is subject to copyright restrictions, free for non-commercial use.

提供机构：

FrancophonIA

5,000+

优质数据集

54 个

任务类型

进入经典数据集