Noisy OCR Dataset (NOD)

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/5068734

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021). Source images The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF. Artificial noise application The dataset was created as follows: - First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise. - Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository. - Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions. This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents. The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files. References: Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https: //github.com/PedroBarcha/old-books-dataset. Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science and Information Technology (CSIT), 150–54. IEEE. Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

创建时间：

2021-07-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集