Dataset for the paper "Improving the Quality of Document Embeddings with Post-OCR Correction on a Multilingual Historical Corpus"

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14628711

下载链接

链接失效反馈

官方服务：

资源简介：

Collection of 946 historical documents belonging to the Biblioteca Nacional de Catalunya, spanning from 1914 to 1951. Each document is the issue of a magazine, comprising different articles by different authors. Magazines were selected based on their relevance w.r.t. art in general and, more specifically, early 20th century avant-garde movements (e.g., Dadaism, Cubism, etc.). For each document, there is available the scanning of the original artifact and the plain raw text extracted through ABBYY FineReader OCR tool. Most of the text in the corpus is in Catalan (76% of words overall), with minor presence of Spanish (20) and, to a lesser extent, French (3%) and Italian (1%). One of the peculiarities of this dataset is that not only is it multilingual, but most artifacts are affected by an intra-document mixture of languages.

创建时间：

2025-01-10