dh-unibe/image-text_hgb-kf_mixture
收藏Hugging Face2026-03-16 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/dh-unibe/image-text_hgb-kf_mixture
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
config_name: default
features:
- name: image
dtype:
image:
decode: false
- name: xml_content
dtype: string
- name: filename
dtype: string
- name: project_name
dtype: string
splits:
- name: train
num_examples: 154
num_bytes: 4566033472
download_size: 4566033472
dataset_size: 4566033472
configs:
- config_name: default
data_files:
- split: train
path: data/train/**/*.parquet
tags:
- image-to-text
- htr
- trocr
- transcription
- pagexml
license: mit
---
# Dataset Card for transkribus-exports-5657-raw-xml
This dataset was created using pagexml-hf converter from Transkribus PageXML data.
## Dataset Summary
This dataset contains 154 samples across 1 split(s).
Mixed Historisches Grundbuch Basel and Königsfelden.
Partly based on the Königsfelden Data Set (https://zenodo.org/records/5179361).
PageXML representation of the text on the charters. For the transcription guidelines, see: (https://koenigsfelden.sources-online.org/intro.html).
Geographical scope: Switzerland<br>Period: 1300-1350<br>Languages: Middle High German<br>Type of document: Documents and files<br>Provenance: State Archives of Aargau<br>
### Projects Included
- HGB_FT_M4_50
- HGB_FT_M4_75
- HGB_FT_M4_95
- u-17_0059
- u-17_0060
- u-17_0061_01
- u-17_0061_02
- u-17_0065
- u-17_0075
- u-17_0083
- u-17_0103
- u-17_0104
- u-17_0126
- u-17_0149
- u-17_0151
- u-17_0152
- u-17_0159
- u-17_0179
- u-17_0185a
- u-17_0187a
## Dataset Structure
### Data Splits
- **train**: 154 samples
### Dataset Size
- Approximate total size: 4354.51 MB
- Total samples: 154
### Features
- **image**: `Image(mode=None, decode=False)`
- **xml_content**: `Value('string')`
- **filename**: `Value('string')`
- **project_name**: `Value('string')`
## Data Organization
Data is organized as parquet shards by split and project:
```
data/
├── <split>/
│ └── <project_name>/
│ └── <timestamp>-<shard>.parquet
```
The HuggingFace Hub automatically merges all parquet files when loading the dataset.
## Usage
```python
from datasets import load_dataset
# Load entire dataset
dataset = load_dataset("dh-unibe/transkribus-exports-5657-raw-xml")
# Load specific split
train_dataset = load_dataset("dh-unibe/transkribus-exports-5657-raw-xml", split="train")
```
提供机构:
dh-unibe



