five

TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15057330
下载链接
链接失效反馈
官方服务:
资源简介:
TextBite is a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. It is mainly aimed at logical segmentation, but can be used for other tasks as well. Additionally, part of the dataset contains handwritten documents, primarily records from schools and public organizations, introducing extra segmentation challenges due to their more loosely structured layouts. In total, the dataset contains 8,449 annotated pages, from which 7,346 pages are printed and 1,103 are handwritten. The pages contain a total of 78,863 segments. The test subset contains 964 pages, of which 185 are handwritten. The annotations are provided in an extended COCO format. Each segment is represented by a set of axis aligned bounding boxes, which are connected by directed relationships, representing reading order. To include these relationships in the COCO format, a new top-level key relations is added. Each relation entry specifies a source and a target bounding box. In addition to the layout annotations, we provide a textual representation of the pages produced by Optical Character Recognition (OCR) tool PERO-OCR. These come in the form of XML files in the PAGE-XML format, which includes an enclosing polygon for each individual textline along with the transcriptions and their confidences. Lastly, we provide the OCR results in the ALTO format, which includes polygons for individual words in the page image.
创建时间:
2025-03-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作