SBB/page_extraction_dataset
收藏Hugging Face2026-01-23 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/SBB/page_extraction_dataset
下载链接
链接失效反馈官方服务:
资源简介:
在数字化文化遗产物品(如书籍、报纸和档案记录)中,文档扫描导致的黑色边缘可能对OCR产生负面影响。为了进行文档布局分析(DLA),需要裁剪这些黑色边缘并正确提取页面。为了训练能够提取页面的机器学习模型,创建了这个数据集。该数据集的机器学习任务属于图像分割领域,更广泛地说,属于计算机视觉领域。数据集由柏林国家图书馆的Vahid Rezanezhad在研究项目Mensch.Maschine.Kultur – Künstliche Intelligenz für das Digitale Kulturelle Erbe中编译,项目由德国联邦政府文化和媒体专员(BKM)资助。数据集包含来自不同来源的图像文件,包括未发布的SBB图像和标注文件。
In digitised cultural heritage items such as books, newspapers and archival records, a problem that can negatively affect OCR are black margins around a page caused by document scanning. In order to enable document layout analysis (DLA), these black margins need to be cropped and the pages need to be extracted correctly. To enable the training of a machine learning model capable of extracting pages, a dataset was created. The machine learning task for which this dataset was collected falls into the domain of image segmentation and, more generally, of computer vision. The dataset was compiled by Vahid Rezanezhad within the research project Mensch.Maschine.Kultur – Künstliche Intelligenz für das Digitale Kulturelle Erbe at the Staatsbibliothek zu Berlin – Berlin State Library (SBB). The research project was funded by the Federal Government Commissioner for Culture and the Media (BKM). The dataset includes image files from various sources, including previously unpublished SBB images and labeled files.
提供机构:
SBB



