five

CGRE Framework Dataset - A Dataset automatically generated to evaluate OCR Software on Webdocuments

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3964398
下载链接
链接失效反馈
官方服务:
资源简介:
Description The provided dataset was generated by the CGRE Framework. It was generated as a part of a bachelor thesis and used to evaluate the Tesseract OCR Software on webdocuments. CGRE_dataset.zip: 1. crawl.json This file contains crawling results from the alexa.com Top 50 most used webpages in the US from the 7th June 2020. The crawling was done specifically for styling information only. 2. html The generated webdocuments can be found in this directory. They are based on the crawled styling information. The levels of the directory are used to store the different  styling attributes. Every directory is named by the used value for a specific styling attribute. Every word is placed in a span html element. 3. dataset The rendered webdocuments can be found in this directory as png files. They were rendered using the Chromium Embedded Framework (CEF) and contain corresponding labels. The labels are in the same directory with the same name as the corresponding rendered webdocument, just as txt files. The labels contain "word\t(left,top,width,height)\n" lines. "(left,top,width,height)" is the bounding box of a span element containing a word. "word" is the word in the bounding box. 4. dataset_tesseract_complete This directory contains the Tesseract results on the dataset as txt files. The structure is analogue to the dataset. The txt files contain analogue to the dataset "word\t(left,top,width,height)\n" lines. 5. evaluation The results of the evaluation of Tesseract on the dataset. To evaluate the localisation of words by Tesseract, the Intersection Over Union metric was used, with different threshold values (0.5, 0.6, 0.7, 0.8, 0.9). To evaluate the determination of words by Tesseract, a normalized Levenshtein distance metric was used, with different threshold values (0.5, 0.6, 0.7, 0.8, 0.9). The times were measured by using this system: Ubuntu 20.04, AMD Ryzen 5 1600 CPU, AMD Radeon RX Vega 56 GPU, 16 GB DDR4 RAM with 2400 MHz The different threshold values are stored in the filenames. You can find the results in the csv files. Every line contains the results for a specific webdocument. The txt files contain calculated precision and recall values.
创建时间:
2020-08-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作