Kansallisarkisto/AIDA_ocr_training_data

Name: Kansallisarkisto/AIDA_ocr_training_data
Creator: Kansallisarkisto
Published: 2024-12-03 13:45:28
License: 暂无描述

Hugging Face2024-12-03 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Kansallisarkisto/AIDA_ocr_training_data

下载链接

链接失效反馈

官方服务：

资源简介：

--- size_categories: - 100K<n<1M license: mit language: - fi tags: - HTR - OCR configs: - config_name: default data_files: - split: train path: "final_rec_data.zip" --- # OCR training data from AIDA-project <img src='kuvat/Kuva12.png' width='500'> ### Dataset Summary The zip file contains textlines and their annotations from AIDA-project. There are ~ 166k textlines that are mainly in Finnish language, but contain a little Swedish and English and little French and German textlines. The textlines contains typewritten and also handwritten lines. Roughly 24 % of the annotated lines are handwritten and the rest are typewritten. The dataset also contains 120 000 synthetic images. ### Supported Tasks The dataset was created mainly for text recognition task. ### Languages The majority of the textlines are in Finnish, but some are in Swedish and English. In addition to this there are few French and German textlines. ## Dataset structure ### Data Instances The zip file contains two folders. Folder called text_lines contains all the text lines. The other folder called annotations contain the annotations in PaddleOCR format. The annotations are divided into train, validation and test sets. In addition to this, the annotations are divided into handwritten, typewritten and ship, which contains annotations of ship records that are mainly handwritten. Handwritten and typewritten annotations are also divided into "best" and "semi" files. "Best" means that the annotator has understood every letter in the line as "semi" means that some character are not understood. ### Data Fields PaddleOCR format means that the annotations are saved into a txt file containing multiple annotations. One annotations is placed per line in the file. First, the format contains a path to an image, then a separating "\t" character and then the transcription. An example of the format is shown below. ``` /path/to/0001.jpg\tHello World /path/to/0002.jpg\tThis is PaddleOCR format. ... ``` ### Data Splits Below is how the annotated data is split. The number in parantheses shows the amount of "semi" textlines. | Dataset Split | Typewritten | Handwritten | Ship Registry | | ------------- | ----------- | ----------- | ------------- | | Train | 22253 (248) | 6943 (424) | 3796 | | Validation | 4744 (9) | 1151 (25) | 469 | | Test | 4272 (3) | 1270 (16) | 472 | ## Dataset Creation ### Source Data The data is collected from Central Archives for Finnish Business (ELKA). It consists of various document types including letters, ship records, business publications etc. It includes correspondence between companies, organizations and the public. ### Who are the source language producers? Given the various types of archival material used in annotation, the scope of producers of the original texts is broad. It includes private individuals and employees of different companies. ### Annotations The textlines were first cropped out of the original image and then transcribed. If the transcription was unclear, the annotator marked it as either "somewhat unclear" or "unclear". Unclear images were discarded, but the "somewhat discarded" images are presented here as in the "semi" annotation files. The rough estimate for "somewhat unclear" class is that less than 100% and more than 50% of the characters are unclear. ### Who are the annotators? Annotators were employees of National Archives of Finland and ELKA. ### Synthetic data As a way to increase the amount of training data, we created synthetic data by using this library https://github.com/Belval/TextRecognitionDataGenerator. We collected Finnish books from https://www.gutenberg.org/ and Finnish magazines from https://archive.org/ and created different kinds of textlines. The different kinds include normal textlines, rotated textlines, textlines following a sinosoidal curve and textlines where characters are subjected to noise. ### Personal and Sensitive Information The dataset is not anonymized, so individuals' names can be found in the dataset.

提供机构：

Kansallisarkisto

5,000+

优质数据集

54 个

任务类型

进入经典数据集