five

OCR 18th Century training and testing dataset

收藏
Zenodo2026-02-10 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18597235
下载链接
链接失效反馈
官方服务:
资源简介:
OCR 18th century training dataset To adapt OCR models to the typographic and material characteristics of 18th-century printed texts, we constructed a custom line-level training corpus combining manually annotated, semi-automatically generated, and synthetic data. This design balances transcription fidelity, historical diversity, and scalability while preserving historically meaningful orthographic variation. The manually annotated subset contains approximately 62,000 color line images cropped from original scans, primarily from McGill University collections. These include running text as well as non-body text elements common in early print, such as page numbers, marginalia, footnotes, catchwords, and signature marks. Ground truth was produced by manually correcting initial OCR outputs, yielding high-fidelity transcriptions that preserve visual artefacts such as uneven inking and paper texture. The semi-automated subset consists of approximately 68,000 black-and-white line images cropped from Eighteenth Century Collections Online (ECCO) documents spanning over 2{,}000 books. This subset was constructed by aligning ECCO page images and OCR output with corresponding page-level transcriptions from the manually transcribed and segmented ECCO-TCP corpus. Page-level alignment was followed by line-level matching using edit-distance–based alignment, whereby each ECCO-TCP lines were paired with the most similar OCR lines on the same page, yielding a silver-standard set grounded in authentic historical sources. To improve quality, the aligned data were filtered to remove structurally unreliable lines and pages (e.g., very short lines, poor-quality scans, layout elements such as tables or endnotes, and pages with high overall mismatch rates). From the resulting silver standard, multiple training subsets were sampled, including random and stratified samples reflecting varying OCR difficulty and bibliographic characteristics. This procedure increases typographic and editorial diversity while maintaining close alignment with historical ground truth. To expand coverage and enable controlled augmentation, we generated approximately 65,000 synthetic line images in English (77%) and French (23%) using historically informed typefaces, combining IM FELL English to capture irregular early modern letterforms (e.g., long-s and uneven stroke contrast) with EB Garamond to introduce cleaner but still period-appropriate serif typography; synthetic images further incorporate background textures from real documents to simulate paper noise and bleed-through. In total, the training corpus comprises approximately 195,000 line images. Manual data provide reliable supervision, semi-automatic data broaden historical coverage, and synthetic data increase exposure to rare glyphs, spellings, and layout phenomena.  OCR 18th century testing dataset English Historical dataset The primary evaluation corpus consists of approximately 200 pages (about 10,000 line images) of English historical texts from the early 18th century onward, including works by authors such as David Hume, John Locke, and Adam Ferguson. Sources include both black-and-white scans from ECCO  and higher-quality grayscale or color scans from the Internet Archive. This dataset supports analysis of scan modality effects and robustness to visual variation. French Historical Dataset To assess cross-lingual generalization, we additionally evaluate on a smaller French corpus of approximately 40 pages (about 2,000 line images) from 18th-century texts, including works by Voltaire and the "Encyclopédie, ou dictionnaire raisonné des sciences, des arts et des métiers (1751–1772)". These data are sourced from color Internet Archive scans and are used for high-level comparative evaluation rather than detailed error typology analysis.
提供机构:
Zenodo
创建时间:
2026-02-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作