Finnish Court Records-sub500. A dataset of Finnish notarial records (19th Century)
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3945087
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is a selection of 500 pages from the Renovated District Court Records (19th century), one of the largest collections in the National Archives of Finland. The documents consists of records of deeds, mortgages, traditional life-annuity, among others.
This dataset contains images with one or two document pages, and it is annotated at image level using six different
region types along with the baselines and line level transcription (Swedish). This blend of single page and double page images is a common complexity found in historical documents.
Layout labels are:
1. page-number: the page number, commonly placed on the top-right corner of the image,
2. paragraph: a paragraph placed on a single page image or on the left side of a double page image.
3. paragraph_2: a paragraph placed on the right side of a double page image.
4. marginalia: any annotation on the margin of the document,
5. table: a table placed on a single page image or on the left side of a double page image.
6. table_2: a table placed on the right side of a double page image.
The images along with their respective ground-truth was compiled in PAGE compliant XML format by the National Archives of Finland and the HTR group of the Pattern Recognition and Human Language Technologies Research Center.
本数据集选自芬兰国家档案馆馆藏规模最大的藏品之一——《19世纪区法院修复档案》,包含其中500页档案。该档案涵盖契约、抵押、传统终身年金等各类官方记录。
本数据集包含单页或双页文档图像,采用6种区域类型完成图像级标注,并附带基线与行级瑞典语转录文本。单页与双页图像的混合是历史文档中常见的复杂特征。
布局标签如下:
1. 页码(page-number):通常位于图像右上角的页码标注
2. 段落(paragraph):单页图像或双页图像左侧的段落内容
3. 右侧段落(paragraph_2):双页图像右侧的段落内容
4. 页边批注(marginalia):文档页边的所有批注内容
5. 表格(table):单页图像或双页图像左侧的表格
6. 右侧表格(table_2):双页图像右侧的表格
本数据集的图像及其对应的基准真值(ground-truth)由芬兰国家档案馆与模式识别与人类语言技术研究中心的手写文本识别(Handwritten Text Recognition,HTR)团队以符合PAGE标准的XML格式整理完成。
创建时间:
2020-07-15



