biglam/yalta_ai_segmonto_manuscript_dataset

Name: biglam/yalta_ai_segmonto_manuscript_dataset
Creator: biglam
Published: 2022-08-12 08:33:43
License: 暂无描述

Hugging Face2022-08-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/biglam/yalta_ai_segmonto_manuscript_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: [] language_creators: - expert-generated license: - cc-by-4.0 multilinguality: [] pretty_name: YALTAi Tabular Dataset size_categories: - n<1K source_datasets: [] tags: - manuscripts - LAM task_categories: - object-detection task_ids: [] --- # YALTAi Segmonto Manuscript and Early Printed Book Dataset ## Table of Contents - [YALTAi Segmonto Manuscript and Early Printed Book Dataset](#Segmonto Manuscript and Early Printed Book Dataset) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://doi.org/10.5281/zenodo.6814770](https://doi.org/10.5281/zenodo.6814770) - **Paper:** [https://arxiv.org/abs/2207.11230](https://arxiv.org/abs/2207.11230) ### Dataset Summary This dataset contains a subset of data used in the paper [You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine](https://arxiv.org/abs/2207.11230). This paper proposes treating page layout recognition on historical documents as an object detection task (compared to the usual pixel segmentation approach). This dataset contains images from digitised manuscripts and early printed books with the following labels: - DamageZone - DigitizationArtefactZone - DropCapitalZone - GraphicZone - MainZone - MarginTextZone - MusicZone - NumberingZone - QuireMarksZone - RunningTitleZone - SealZone - StampZone - TableZone - TitlePageZone ### Supported Tasks and Leaderboards - `object-detection`: This dataset can be used to train a model for object-detection on historic document images. ## Dataset Structure This dataset has two configurations. These configurations both cover the same data and annotations but provide these annotations in different forms to make it easier to integrate the data with existing processing pipelines. - The first configuration, `YOLO`, uses the data's original format. - The second configuration converts the YOLO format into a format closer to the `COCO` annotation format. This is done to make it easier to work with the `feature_extractor` from the `Transformers` models for object detection, which expect data to be in a COCO style format. ### Data Instances An example instance from the COCO config: ```python {'height': 5610, 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=3782x5610 at 0x7F3B785609D0>, 'image_id': 0, 'objects': [{'area': 203660, 'bbox': [1545.0, 207.0, 1198.0, 170.0], 'category_id': 9, 'id': 0, 'image_id': '0', 'iscrowd': False, 'segmentation': []}, {'area': 137034, 'bbox': [912.0, 1296.0, 414.0, 331.0], 'category_id': 2, 'id': 0, 'image_id': '0', 'iscrowd': False, 'segmentation': []}, {'area': 110865, 'bbox': [2324.0, 908.0, 389.0, 285.0], 'category_id': 2, 'id': 0, 'image_id': '0', 'iscrowd': False, 'segmentation': []}, {'area': 281634, 'bbox': [2308.0, 3507.0, 438.0, 643.0], 'category_id': 2, 'id': 0, 'image_id': '0', 'iscrowd': False, 'segmentation': []}, {'area': 5064268, 'bbox': [949.0, 471.0, 1286.0, 3938.0], 'category_id': 4, 'id': 0, 'image_id': '0', 'iscrowd': False, 'segmentation': []}, {'area': 5095104, 'bbox': [2303.0, 539.0, 1338.0, 3808.0], 'category_id': 4, 'id': 0, 'image_id': '0', 'iscrowd': False, 'segmentation': []}], 'width': 3782} ``` An example instance from the YOLO config: ```python {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=3782x5610 at 0x7F3B785EFA90>, 'objects': {'bbox': [[2144, 292, 1198, 170], [1120, 1462, 414, 331], [2519, 1050, 389, 285], [2527, 3828, 438, 643], [1593, 2441, 1286, 3938], [2972, 2444, 1338, 3808]], 'label': [9, 2, 2, 2, 4, 4]}} ``` ### Data Fields The fields for the YOLO config: - `image`: the image - `objects`: the annotations which consist of: - `bbox`: a list of bounding boxes for the image - `label`: a list of labels for this image The fields for the COCO config: - `height`: height of the image - `width`: width of the image - `image`: image - `image_id`: id for the image - `objects`: annotations in COCO format, consisting of a list containing dictionaries with the following keys: - `bbox`: bounding boxes for the images - `category_id`: a label for the image - `image_id`: id for the image - `iscrowd`: COCO is a crowd flag - `segmentation`: COCO segmentation annotations (empty in this case but kept for compatibility with other processing scripts) ### Data Splits The dataset contains a train, validation and test split with the following numbers per split: | Dataset | Number of images | |---------|------------------| | Train | 854 | | Dev | 154 | | Test | 139 | A more detailed summary of the dataset (copied from the paper): | | Train | Dev | Test | Total | Average area | Median area | |--------------------------|------:|----:|-----:|------:|-------------:|------------:| | DropCapitalZone | 1537 | 180 | 222 | 1939 | 0.45 | 0.26 | | MainZone | 1408 | 253 | 258 | 1919 | 28.86 | 26.43 | | NumberingZone | 421 | 57 | 76 | 554 | 0.18 | 0.14 | | MarginTextZone | 396 | 59 | 49 | 504 | 1.19 | 0.52 | | GraphicZone | 289 | 54 | 50 | 393 | 8.56 | 4.31 | | MusicZone | 237 | 71 | 0 | 308 | 1.22 | 1.09 | | RunningTitleZone | 137 | 25 | 18 | 180 | 0.95 | 0.84 | | QuireMarksZone | 65 | 18 | 9 | 92 | 0.25 | 0.21 | | StampZone | 85 | 5 | 1 | 91 | 1.69 | 1.14 | | DigitizationArtefactZone | 1 | 0 | 32 | 33 | 2.89 | 2.79 | | DamageZone | 6 | 1 | 14 | 21 | 1.50 | 0.02 | | TitlePageZone | 4 | 0 | 1 | 5 | 48.27 | 63.39 | ## Dataset Creation This dataset is derived from: - CREMMA Medieval ( Pinche, A. (2022). Cremma Medieval (Version Bicerin 1.1.0) [Data set](https://github.com/HTR-United/cremma-medieval) - CREMMA Medieval Lat (Clérice, T. and Vlachou-Efstathiou, M. (2022). Cremma Medieval Latin [Data set](https://github.com/HTR-United/cremma-medieval-lat) - Eutyches. (Vlachou-Efstathiou, M. Voss.Lat.O.41 - Eutyches "de uerbo" glossed [Data set](https://github.com/malamatenia/Eutyches) - Gallicorpora HTR-Incunable-15e-Siecle ( Pinche, A., Gabay, S., Leroy, N., & Christensen, K. Données HTR incunable du 15e siècle [Computer software](https://github.com/Gallicorpora/HTR-incunable-15e-siecle) - Gallicorpora HTR-MSS-15e-Siecle ( Pinche, A., Gabay, S., Leroy, N., & Christensen, K. Données HTR manuscrits du 15e siècle [Computer software](https://github.com/Gallicorpora/HTR-MSS-15e-Siecle) - Gallicorpora HTR-imprime-gothique-16e-siecle ( Pinche, A., Gabay, S., Vlachou-Efstathiou, M., & Christensen, K. HTR-imprime-gothique-16e-siecle [Computer software](https://github.com/Gallicorpora/HTR-imprime-gothique-16e-siecle) + a few hundred newly annotated data, specifically the test set which is completely novel and based on early prints and manuscripts. These additional annotations were created by correcting an early version of the model developed in the paper using the [roboflow](https://roboflow.com/) platform. ### Curation Rationale [More information needed] ### Source Data The sources of the data are described above. #### Initial Data Collection and Normalization [More information needed] #### Who are the source language producers? [More information needed] ### Annotations #### Annotation process Additional annotations produced for this dataset were created by correcting an early version of the model developed in the paper using the [roboflow](https://roboflow.com/) platform. #### Who are the annotators? [More information needed] ### Personal and Sensitive Information This data does not contain information relating to living individuals. ## Considerations for Using the Data ### Social Impact of Dataset A growing number of datasets are related to page layout for historical documents. This dataset offers a different approach to annotating these datasets (focusing on object detection rather than pixel-level annotations). Improving document layout recognition can have a positive impact on downstream tasks, in particular Optical Character Recognition. ### Discussion of Biases Historical documents contain a wide variety of page layouts. This means that the ability of models trained on this dataset to transfer to documents with very different layouts is not guaranteed. ### Other Known Limitations [More information needed] ## Additional Information ### Dataset Curators ### Licensing Information [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode) ### Citation Information ``` @dataset{clerice_thibault_2022_6814770, author = {Clérice, Thibault}, title = {{YALTAi: Segmonto Manuscript and Early Printed Book Dataset}}, month = jul, year = 2022, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.6814770}, url = {https://doi.org/10.5281/zenodo.6814770} } ``` [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6814770.svg)](https://doi.org/10.5281/zenodo.6814770) ### Contributions Thanks to [@davanstrien](https://github.com/davanstrien) for adding this dataset.

提供机构：

biglam

原始信息汇总

数据集概述

数据集名称： YALTAi Segmonto Manuscript and Early Printed Book Dataset

数据集简介： 该数据集包含用于论文《You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine》的部分数据。论文提出将历史文档的页面布局识别作为对象检测任务处理。数据集包含来自数字化手稿和早期印刷书籍的图像，带有多种标签。

支持的任务： 对象检测

数据集结构

数据配置：

YOLO配置： 使用原始数据格式。
COCO配置： 将YOLO格式转换为接近COCO注释格式，以便与Transformers模型的feature_extractor兼容。

数据实例：

COCO配置： 包含图像的高度、宽度、图像ID和对象注释（包括边界框、类别ID等）。
YOLO配置： 包含图像和对象注释（包括边界框和标签）。

数据字段：

YOLO配置： 图像和对象（边界框和标签）。
COCO配置： 图像、图像ID、高度、宽度、对象注释（包括边界框、类别ID等）。

数据分割：

训练集： 854张图像
验证集： 154张图像
测试集： 139张图像

数据集创建

来源数据： 数据集源自多个资源，包括CREMMA Medieval、CREMMA Medieval Lat等，并包含新注释的数据。

注释过程： 使用roboflow平台对早期模型版本进行校正，生成额外注释。

使用数据注意事项

社会影响： 改善文档布局识别可以积极影响下游任务，如光学字符识别。

偏见讨论： 模型可能难以适应具有非常不同布局的文档。

许可证： 该数据集遵循Creative Commons Attribution 4.0 International许可证。

引用信息：

@dataset{clerice_thibault_2022_6814770, author = {Clérice, Thibault}, title = {{YALTAi: Segmonto Manuscript and Early Printed Book Dataset}}, month = jul, year = 2022, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.6814770}, url = {https://doi.org/10.5281/zenodo.6814770} }

搜集汇总

数据集介绍

构建方式

在历史文献数字化领域，版面布局识别是提升光学字符识别（OCR）等下游任务性能的关键环节。该数据集源自YALTAi研究，创新性地将版面布局识别视为目标检测任务，而非传统的像素级分割。其构建融合了多个现有手稿与早期印刷品数据集，包括CREMMA Medieval、CREMMA Medieval Lat、Eutyches以及Gallicorpora系列数据，并额外补充了数百张全新标注的图像，特别是完全基于早期印刷品和手稿的测试集。这些新增标注通过使用Roboflow平台对模型早期版本进行校正而生成，确保了标注的专业性与准确性。

特点

该数据集的核心特色在于其双重配置设计，分别提供YOLO格式与近似COCO格式的标注，极大地方便了与现有处理流程的集成。其标注体系精细覆盖了14种版面区域类型，如主文本区、装饰首字母区、边缘注释区、音乐符号区等，全面反映了历史文献复杂的版面结构。数据集规模虽小（总计1147张图像），却包含了详尽的训练（854张）、验证（154张）和测试（139张）划分，且各类别区域在数量、面积分布上呈现明显的长尾特性，为评估目标检测模型的鲁棒性提供了严苛的基准。

使用方法

该数据集专为历史文档图像的目标检测任务而设计。使用者可通过HuggingFace Datasets库直接加载，并选择'YOLO'或'COCO'两种配置之一以适应不同的模型框架。对于采用Transformer类目标检测模型的用户，推荐使用COCO配置，因其格式与Transformers库中的feature_extractor高度兼容。数据加载后，可基于提供的图像与边界框标注，训练模型识别手稿与早期印刷品中的各类版面元素，从而推动历史文献自动化分析技术的发展。

背景与挑战

背景概述

在历史文献数字化进程中，页面布局识别是光学字符识别（OCR）等下游任务的关键前置环节。2022年，由Thibault Clérice等研究人员提出的YALTAi（You Actually Look Twice At it）方法，创新性地将历史文档页面布局识别转化为目标检测任务，而非传统的像素级分割方法。该数据集作为YALTAi方法的核心支撑，整合了CREMMA Medieval等多个现有手稿数据集，并新增了数百张早期印刷书籍和手稿的标注图像，共计1147张高分辨率图片，涵盖14种页面元素类别，如主文本区、首字母装饰区、音乐符号区等。数据集以YOLO和COCO两种格式发布，便于集成至不同处理流程，为历史文档布局分析领域提供了全新的技术路径和基准资源。

当前挑战

该数据集所解决的领域核心挑战在于，传统像素级分割方法在处理历史文档时计算成本高昂且对复杂版面适应性不足，YALTAi通过目标检测框架实现了更高效的布局识别。然而，构建过程中面临多重困难：首先，历史文档版面多样性极强，不同时代、地域的书籍布局差异显著，导致模型泛化能力受限；其次，数据标注需专家完成，特别是对损伤区、数字化伪影区等稀有类别的识别，标注成本高且一致性难以保证；再者，部分类别样本严重不均衡（如标题页仅5例），增加了模型训练的难度；最后，测试集完全基于新标注数据，需通过反复迭代模型预测与人工校正来保证标注质量，流程复杂耗时。

常用场景

经典使用场景

该数据集专为历史文档的页面布局识别而设计，其经典使用场景在于将目标检测范式引入古籍与早期印刷品的版面分析任务。相较于传统的像素级语义分割方法，此数据集以边界框标注了包括主文本区、首字母装饰区、边注区、音乐谱例区等在内的十四类布局元素，为训练基于YOLO或Faster R-CNN等目标检测模型的文档结构解析提供了标准化基准。研究者可借此高效定位文档中的功能区域，从而推动数字人文领域对历史文献的自动化理解。

实际应用

在实际应用中，该数据集可驱动历史文献数字化流水线的关键环节。例如，在光学字符识别（OCR）前，利用训练后的模型自动定位主文本区域与装饰元素，从而隔离噪声区域（如污损区或数字化伪影），提升文字识别的准确性。此外，该数据集可辅助图书馆与档案馆对海量古籍进行结构化索引，例如自动提取边注、页眉或印章区域，为后续的文本挖掘与知识图谱构建奠定基础，加速文化遗产的数字化保护进程。

衍生相关工作

该数据集衍生了一系列开创性工作，最核心的当属其基础论文《You Actually Look Twice At it (YALTAi)》，该工作首次系统论证了在Kraken引擎中以目标检测替代区域分割的可行性，并开源了配套模型与标注工具。后续研究在此基础上扩展了数据集规模，例如结合CREMMA Medieval等资源构建多语种手稿布局基准；此外，部分工作探索了将检测结果与Transformer架构结合，以优化跨页面布局的泛化性能，推动了历史文档图像分析从像素级向对象级理解的范式转变。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集