dh-unibe/image-text_kurrent-xix
收藏Hugging Face2026-04-25 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/dh-unibe/image-text_kurrent-xix
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为image-text_kurrent-xix,是一个从Transkribus PageXML数据使用pagexml-hf转换器创建的数据集。它包含158,525个样本,全部位于训练分割中。数据集中包括多个项目,如MM_1_001至MM_1_012、TEST_CITlab和TRAIN_CITlab系列项目等,这些项目可能代表不同的手写文本来源或子集。数据集特征包括图像(未解码)、XML内容(字符串格式)、文件名(字符串)和项目名称(字符串)。数据以parquet文件形式组织,按分割和项目分类存储,总大小约为14,843,467.55 MB。该数据集适用于图像到文本任务,特别是手写文本识别(HTR)、转录和Kurrent字体(一种历史德文手写体)或19世纪手写文本的处理,标签包括image-to-text、htr、trocr、transcription和pagexml。许可证为MIT。
This dataset, named image-text_kurrent-xix, was created using the pagexml-hf converter from Transkribus PageXML data. It contains 158,525 samples across a single split (train). The dataset includes multiple projects such as MM_1_001 through MM_1_012, TEST_CITlab and TRAIN_CITlab series projects, among others, which likely represent various handwritten text sources or subsets. Features include image (not decoded), xml_content (string), filename (string), and project_name (string). Data is organized in parquet files by split and project, with an approximate total size of 14,843,467.55 MB. It is designed for image-to-text tasks, particularly handwriting text recognition (HTR), transcription, and processing of Kurrent script (a historical German handwriting style) or 19th-century handwritten texts, with tags including image-to-text, htr, trocr, transcription, and pagexml. The license is MIT.
提供机构:
dh-unibe



