Rahvusarhiiv/et_handwriting_complete
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Rahvusarhiiv/et_handwriting_complete
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- image-to-text
language:
- et
license: cc0-1.0
tags:
- Estonia
- historical-documents
- page-xml
- alto-xml
- transkribus
- ocr
- layout-analysis
- document-structure
pretty_name: Estonian Handwriting Full XML
size_categories:
- 1K<n<10K
---
# Dataset of Full PAGE XML and ALTO Annotations in Handwritten Estonian Documents
## Dataset Description
This dataset contains full page-level Transkribus exports from Estonian historical documents. Each example pairs a full page image with the corresponding PAGE XML and ALTO XML for the same page, preserving document structure, layout coordinates, reading order, baselines, and text content where available.
The dataset is intended for OCR research, layout analysis, document structure modelling, XML parsing, and conversion benchmarking between document analysis formats.
## 📊 Dataset Summary
- **Total Examples**: 1,700 pages
- **Language**: 🇪🇪 Estonian
- **Dataset Size**: ~2.1 GB
- **Task**: OCR, Layout Analysis, Document Structure Analysis, XML Parsing
- **Domain**: Historical Documents, Archival Materials
## 🗂️ Dataset Structure
### 📋 Features
- **image**: Full page image (PIL Image)
- **page**: Full PAGE XML as a UTF-8 string
- **alto**: Full ALTO XML as a UTF-8 string
- **document_title**: Source document name
- **AIS_reference**: [AIS](https://ais.ra.ee/en) file reference number
- **page_number**: Location of frame in file
### 🎯 XML Formats
The dataset stores the original XML content as readable strings.
**PAGE XML** contains page metadata, reading order, text regions, text lines, coordinates, baselines, and text content:
```xml
<PcGts ...>
<Page imageFilename="..." imageWidth="5472" imageHeight="3648">
<ReadingOrder>...</ReadingOrder>
<TextRegion ...>
<Coords points="..."/>
<TextLine ...>
<Coords points="..."/>
<Baseline points="..."/>
<TextEquiv>
<Unicode>...</Unicode>
</TextEquiv>
</TextLine>
</TextRegion>
</Page>
</PcGts>
```
**ALTO XML** contains layout blocks, text lines, token-level strings, and page-level image metadata:
```xml
<alto ...>
<Layout>
<Page WIDTH="5472" HEIGHT="3648">
<PrintSpace>
<TextBlock>
<TextLine>
<String CONTENT="..."/>
</TextLine>
</TextBlock>
</PrintSpace>
</Page>
</Layout>
</alto>
```
## 🔧 Technical Details
### XML Standards
- PAGE XML follows the PAGE schema used by Transkribus exports
- ALTO XML follows the ALTO schema for OCR and layout representation
- Both XML fields are stored as full strings and can be parsed with standard XML tooling
### Coordinate System
- Coordinates are expressed in pixel space relative to the full page image
- The origin `(0,0)` is at the top-left corner
- PAGE XML includes polygon coordinates and baselines
- ALTO XML includes page and layout geometry in the same page coordinate space
### Data Processing
- Extracted from Transkribus exports
- Full XML content preserved instead of flattening annotations into simpler fields
- Images are stored alongside their corresponding XML representations
- Document metadata is retained through `document_title`, `AIS_reference`, and `page_number`
## ⚠️ Data Quality Notes
- Image quality varies depending on the preservation state of the historical documents
- XML quality depends on the original Transkribus annotations and export pipeline
- Some pages may contain irregular layouts, overlapping regions, or complex handwriting
- PAGE XML and ALTO XML represent the same page but may differ in structure and granularity
## 🚫 Limitations
- Limited to Estonian language historical documents
- Data is stored as raw XML strings, so downstream use typically requires XML parsing
- Historical handwriting and scan quality may affect OCR and layout consistency
- Some pages may contain incomplete, noisy, or inconsistent annotations
## 📞 Contact
Depending on the nature of your question, please contact one of the following:
- Content of dataset: [@svlp](https://huggingface.co/svlp) or [@LudwigRoine](https://huggingface.co/LudwigRoine)
- Format or anything technical: [@paulpall](https://huggingface.co/paulpall)
- For everything else: [the National Archives of Estonia](https://www.ra.ee/en/kontakt/).
***

提供机构:
Rahvusarhiiv



