caveman273/aida-handwritten
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/caveman273/aida-handwritten
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- fi
- sv
- en
pretty_name: AIDA Handwritten
size_categories:
- 1K<n<10K
task_categories:
- image-to-text
tags:
- HTR
- OCR
- handwritten
configs:
- config_name: default
data_files:
- split: train
path: "train.parquet"
- split: test
path: "test.parquet"
- split: validation
path: "validation.parquet"
features:
- name: image
dtype: image
- name: text
dtype: string
- name: file_name
dtype: string
---
# Handwritten OCR training data from AIDA-project
### Dataset Summary
This dataset contains handwritten textline images and their transcriptions from the AIDA-project. It is a subset of the full AIDA dataset, containing only the **best-quality handwritten** annotations — lines where the annotator was confident about every character. The majority of lines are in Finnish, with some Swedish, English, French, and German.
### Supported Tasks
The dataset was created for handwritten text recognition (HTR).
### Languages
The majority of the textlines are in Finnish, but some are in Swedish and English. In addition to this there are few French and German textlines.
## Dataset structure
### Data Instances
Each row contains:
- `image`: the textline image (PNG bytes + filename)
- `text`: the transcription
- `file_name`: the original image filename
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `image` | Image | Textline image |
| `text` | string | Ground-truth transcription |
| `file_name` | string | Original image filename |
### Data Splits
This dataset contains only the "best" handwritten annotations (every character understood by the annotator). The number in parentheses shows the additional "semi" lines (some characters unclear) not included here.
| Dataset Split | Handwritten |
| ------------- | ----------- |
| Train | 6943 |
| Validation | 1151 |
| Test | 1270 |
## Dataset Creation
### Source Data
The data is collected from Central Archives for Finnish Business (ELKA). It consists of various document types including letters, ship records, business publications etc. It
includes correspondence between companies, organizations and the public.
### Who are the source language producers?
Given the various types of archival material used in annotation, the scope of producers of the original texts is broad. It includes private individuals and employees of
different companies.
### Annotations
The textlines were first cropped out of the original image and then transcribed. If the transcription was unclear, the annotator marked it as either "somewhat unclear" or
"unclear". Unclear images were discarded, but the "somewhat discarded" images are presented here as in the "semi" annotation files. The rough estimate for "somewhat
unclear" class is that less than 100% and more than 50% of the characters are unclear.
### Who are the annotators?
Annotators were employees of National Archives of Finland and ELKA.
### Synthetic data
As a way to increase the amount of training data, we created synthetic data by using this library https://github.com/Belval/TextRecognitionDataGenerator. We collected Finnish books from https://www.gutenberg.org/ and Finnish magazines from https://archive.org/ and created different kinds of textlines. The different kinds include normal textlines, rotated textlines, textlines following a sinosoidal curve and textlines where characters are subjected to noise.
### Personal and Sensitive Information
The dataset is not anonymized, so individuals' names can be found in the dataset.
提供机构:
caveman273



