CGRE Framework Dataset - A Dataset automatically generated to evaluate OCR Software on Webdocuments
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3964398
下载链接
链接失效反馈官方服务:
资源简介:
Description
The provided dataset was generated by the CGRE Framework.
It was generated as a part of a bachelor thesis and used to evaluate the Tesseract OCR Software on webdocuments.
CGRE_dataset.zip:
1. crawl.json
This file contains crawling results from the alexa.com Top 50 most used webpages in the US from the 7th June 2020.
The crawling was done specifically for styling information only.
2. html
The generated webdocuments can be found in this directory.
They are based on the crawled styling information.
The levels of the directory are used to store the different styling attributes.
Every directory is named by the used value for a specific styling attribute.
Every word is placed in a span html element.
3. dataset
The rendered webdocuments can be found in this directory as png files.
They were rendered using the Chromium Embedded Framework (CEF) and contain corresponding labels.
The labels are in the same directory with the same name as the corresponding rendered webdocument, just as txt files.
The labels contain "word\t(left,top,width,height)\n" lines.
"(left,top,width,height)" is the bounding box of a span element containing a word.
"word" is the word in the bounding box.
4. dataset_tesseract_complete
This directory contains the Tesseract results on the dataset as txt files.
The structure is analogue to the dataset.
The txt files contain analogue to the dataset "word\t(left,top,width,height)\n" lines.
5. evaluation
The results of the evaluation of Tesseract on the dataset.
To evaluate the localisation of words by Tesseract, the Intersection Over Union metric was used, with different threshold values (0.5, 0.6, 0.7, 0.8, 0.9).
To evaluate the determination of words by Tesseract, a normalized Levenshtein distance metric was used, with different threshold values (0.5, 0.6, 0.7, 0.8, 0.9).
The times were measured by using this system:
Ubuntu 20.04, AMD Ryzen 5 1600 CPU, AMD Radeon RX Vega 56 GPU, 16 GB DDR4 RAM with 2400 MHz
The different threshold values are stored in the filenames.
You can find the results in the csv files.
Every line contains the results for a specific webdocument.
The txt files contain calculated precision and recall values.
创建时间:
2020-08-06



