dh-unibe/image-text_kurrent-xix

Name: dh-unibe/image-text_kurrent-xix
Creator: dh-unibe
Published: 2026-04-25 21:26:42
License: 暂无描述

Hugging Face2026-04-25 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/dh-unibe/image-text_kurrent-xix

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为image-text_kurrent-xix，是一个从Transkribus PageXML数据使用pagexml-hf转换器创建的数据集。它包含158,525个样本，全部位于训练分割中。数据集中包括多个项目，如MM_1_001至MM_1_012、TEST_CITlab和TRAIN_CITlab系列项目等，这些项目可能代表不同的手写文本来源或子集。数据集特征包括图像（未解码）、XML内容（字符串格式）、文件名（字符串）和项目名称（字符串）。数据以parquet文件形式组织，按分割和项目分类存储，总大小约为14,843,467.55 MB。该数据集适用于图像到文本任务，特别是手写文本识别（HTR）、转录和Kurrent字体（一种历史德文手写体）或19世纪手写文本的处理，标签包括image-to-text、htr、trocr、transcription和pagexml。许可证为MIT。

This dataset, named image-text_kurrent-xix, was created using the pagexml-hf converter from Transkribus PageXML data. It contains 158,525 samples across a single split (train). The dataset includes multiple projects such as MM_1_001 through MM_1_012, TEST_CITlab and TRAIN_CITlab series projects, among others, which likely represent various handwritten text sources or subsets. Features include image (not decoded), xml_content (string), filename (string), and project_name (string). Data is organized in parquet files by split and project, with an approximate total size of 14,843,467.55 MB. It is designed for image-to-text tasks, particularly handwriting text recognition (HTR), transcription, and processing of Kurrent script (a historical German handwriting style) or 19th-century handwritten texts, with tags including image-to-text, htr, trocr, transcription, and pagexml. The license is MIT.

提供机构：

dh-unibe

5,000+

优质数据集

54 个

任务类型

进入经典数据集