five

caveman273/aida-handwritten

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/caveman273/aida-handwritten
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - fi - sv - en pretty_name: AIDA Handwritten size_categories: - 1K<n<10K task_categories: - image-to-text tags: - HTR - OCR - handwritten configs: - config_name: default data_files: - split: train path: "train.parquet" - split: test path: "test.parquet" - split: validation path: "validation.parquet" features: - name: image dtype: image - name: text dtype: string - name: file_name dtype: string --- # Handwritten OCR training data from AIDA-project ### Dataset Summary This dataset contains handwritten textline images and their transcriptions from the AIDA-project. It is a subset of the full AIDA dataset, containing only the **best-quality handwritten** annotations — lines where the annotator was confident about every character. The majority of lines are in Finnish, with some Swedish, English, French, and German. ### Supported Tasks The dataset was created for handwritten text recognition (HTR). ### Languages The majority of the textlines are in Finnish, but some are in Swedish and English. In addition to this there are few French and German textlines. ## Dataset structure ### Data Instances Each row contains: - `image`: the textline image (PNG bytes + filename) - `text`: the transcription - `file_name`: the original image filename ### Data Fields | Field | Type | Description | |-------|------|-------------| | `image` | Image | Textline image | | `text` | string | Ground-truth transcription | | `file_name` | string | Original image filename | ### Data Splits This dataset contains only the "best" handwritten annotations (every character understood by the annotator). The number in parentheses shows the additional "semi" lines (some characters unclear) not included here. | Dataset Split | Handwritten | | ------------- | ----------- | | Train | 6943 | | Validation | 1151 | | Test | 1270 | ## Dataset Creation ### Source Data The data is collected from Central Archives for Finnish Business (ELKA). It consists of various document types including letters, ship records, business publications etc. It includes correspondence between companies, organizations and the public. ### Who are the source language producers? Given the various types of archival material used in annotation, the scope of producers of the original texts is broad. It includes private individuals and employees of different companies. ### Annotations The textlines were first cropped out of the original image and then transcribed. If the transcription was unclear, the annotator marked it as either "somewhat unclear" or "unclear". Unclear images were discarded, but the "somewhat discarded" images are presented here as in the "semi" annotation files. The rough estimate for "somewhat unclear" class is that less than 100% and more than 50% of the characters are unclear. ### Who are the annotators? Annotators were employees of National Archives of Finland and ELKA. ### Synthetic data As a way to increase the amount of training data, we created synthetic data by using this library https://github.com/Belval/TextRecognitionDataGenerator. We collected Finnish books from https://www.gutenberg.org/ and Finnish magazines from https://archive.org/ and created different kinds of textlines. The different kinds include normal textlines, rotated textlines, textlines following a sinosoidal curve and textlines where characters are subjected to noise. ### Personal and Sensitive Information The dataset is not anonymized, so individuals' names can be found in the dataset.
提供机构:
caveman273
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作