five

Data of Paper "Turning a Multilingual Historical Archive into an Information System through Post-OCR Correction and Content-Based Indexation"

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10201751
下载链接
链接失效反馈
官方服务:
资源简介:
We evaluated our approach on a collection of 946 historical documents belonging to the Biblioteca Nacional de Catalunya (BNC), spanning from 1914 to 1951. Each document is the issue of a magazine, comprising different articles by different authors. This implies that, despite the thematic nature of magazines and specific issues, there is a certain degree of heterogeneity in each document. Magazines were selected based on their relevance w.r.t. art in general and, more specifically, early 20th century avant-garde movements (e.g., Dadaism, Cubism, etc.). For each document, we have the scanning of the original artifact and the plain raw text extracted through ABBYY FineReader OCR tool. To the best of our knowledge, this is the first Catalan-dominated OCR corpus ever released.
创建时间:
2023-11-23
二维码
社区交流群
二维码
科研交流群
商业服务