five

Rahvusarhiiv/et_handwriting_complete

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Rahvusarhiiv/et_handwriting_complete
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - image-to-text language: - et license: cc0-1.0 tags: - Estonia - historical-documents - page-xml - alto-xml - transkribus - ocr - layout-analysis - document-structure pretty_name: Estonian Handwriting Full XML size_categories: - 1K<n<10K --- # Dataset of Full PAGE XML and ALTO Annotations in Handwritten Estonian Documents ## Dataset Description This dataset contains full page-level Transkribus exports from Estonian historical documents. Each example pairs a full page image with the corresponding PAGE XML and ALTO XML for the same page, preserving document structure, layout coordinates, reading order, baselines, and text content where available. The dataset is intended for OCR research, layout analysis, document structure modelling, XML parsing, and conversion benchmarking between document analysis formats. ## 📊 Dataset Summary - **Total Examples**: 1,700 pages - **Language**: 🇪🇪 Estonian - **Dataset Size**: ~2.1 GB - **Task**: OCR, Layout Analysis, Document Structure Analysis, XML Parsing - **Domain**: Historical Documents, Archival Materials ## 🗂️ Dataset Structure ### 📋 Features - **image**: Full page image (PIL Image) - **page**: Full PAGE XML as a UTF-8 string - **alto**: Full ALTO XML as a UTF-8 string - **document_title**: Source document name - **AIS_reference**: [AIS](https://ais.ra.ee/en) file reference number - **page_number**: Location of frame in file ### 🎯 XML Formats The dataset stores the original XML content as readable strings. **PAGE XML** contains page metadata, reading order, text regions, text lines, coordinates, baselines, and text content: ```xml <PcGts ...> <Page imageFilename="..." imageWidth="5472" imageHeight="3648"> <ReadingOrder>...</ReadingOrder> <TextRegion ...> <Coords points="..."/> <TextLine ...> <Coords points="..."/> <Baseline points="..."/> <TextEquiv> <Unicode>...</Unicode> </TextEquiv> </TextLine> </TextRegion> </Page> </PcGts> ``` **ALTO XML** contains layout blocks, text lines, token-level strings, and page-level image metadata: ```xml <alto ...> <Layout> <Page WIDTH="5472" HEIGHT="3648"> <PrintSpace> <TextBlock> <TextLine> <String CONTENT="..."/> </TextLine> </TextBlock> </PrintSpace> </Page> </Layout> </alto> ``` ## 🔧 Technical Details ### XML Standards - PAGE XML follows the PAGE schema used by Transkribus exports - ALTO XML follows the ALTO schema for OCR and layout representation - Both XML fields are stored as full strings and can be parsed with standard XML tooling ### Coordinate System - Coordinates are expressed in pixel space relative to the full page image - The origin `(0,0)` is at the top-left corner - PAGE XML includes polygon coordinates and baselines - ALTO XML includes page and layout geometry in the same page coordinate space ### Data Processing - Extracted from Transkribus exports - Full XML content preserved instead of flattening annotations into simpler fields - Images are stored alongside their corresponding XML representations - Document metadata is retained through `document_title`, `AIS_reference`, and `page_number` ## ⚠️ Data Quality Notes - Image quality varies depending on the preservation state of the historical documents - XML quality depends on the original Transkribus annotations and export pipeline - Some pages may contain irregular layouts, overlapping regions, or complex handwriting - PAGE XML and ALTO XML represent the same page but may differ in structure and granularity ## 🚫 Limitations - Limited to Estonian language historical documents - Data is stored as raw XML strings, so downstream use typically requires XML parsing - Historical handwriting and scan quality may affect OCR and layout consistency - Some pages may contain incomplete, noisy, or inconsistent annotations ## 📞 Contact Depending on the nature of your question, please contact one of the following: - Content of dataset: [@svlp](https://huggingface.co/svlp) or [@LudwigRoine](https://huggingface.co/LudwigRoine) - Format or anything technical: [@paulpall](https://huggingface.co/paulpall) - For everything else: [the National Archives of Estonia](https://www.ra.ee/en/kontakt/). *** ![Interreg Central Baltic Programme Co-funded by the European Union ArchXAI](ArchXAI_RGB_PNG.png)
提供机构:
Rahvusarhiiv
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作