five

jwidmer/rawxml-test-cli

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jwidmer/rawxml-test-cli
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: config_name: default features: - name: image dtype: image: decode: false - name: text dtype: string - name: line_id dtype: string - name: line_reading_order dtype: int64 - name: line_coords dtype: sequence: sequence: int64 - name: line_baseline dtype: sequence: sequence: int64 - name: line_augmentation dtype: string - name: region_id dtype: string - name: region_reading_order dtype: int64 - name: region_type dtype: string - name: region_coords dtype: sequence: sequence: int64 - name: filename dtype: string - name: project_name dtype: string splits: - name: train num_examples: 492 num_bytes: 31862904 download_size: 31862904 dataset_size: 31862904 configs: - config_name: default data_files: - split: train path: data/train/**/*.parquet tags: - image-to-text - htr - trocr - transcription - pagexml license: mit --- # Dataset Card for rawxml-test-cli This dataset was created using pagexml-hf converter from Transkribus PageXML data. ## Dataset Summary This dataset contains 492 samples across 1 split(s). ## Dataset Structure ### Data Splits - **train**: 492 samples ### Dataset Size - Approximate total size: 30.39 MB - Total samples: 492 - Number of augmentations: 2 ### Features - **image**: `Image(mode=None, decode=False)` - **text**: `Value('string')` - **line_id**: `Value('string')` - **line_reading_order**: `Value('int64')` - **line_coords**: `List(List(Value('int64')))` - **line_baseline**: `List(List(Value('int64')))` - **line_augmentation**: `Value('string')` - **region_id**: `Value('string')` - **region_reading_order**: `Value('int64')` - **region_type**: `Value('string')` - **region_coords**: `List(List(Value('int64')))` - **filename**: `Value('string')` - **project_name**: `Value('string')` ## Data Organization Data is organized as parquet shards by split and project: ``` data/ ├── <split>/ │ └── <project_name>/ │ └── <timestamp>-<shard>.parquet ``` The HuggingFace Hub automatically merges all parquet files when loading the dataset. ## Usage ```python from datasets import load_dataset # Load entire dataset dataset = load_dataset("jwidmer/rawxml-test-cli") # Load specific split train_dataset = load_dataset("jwidmer/rawxml-test-cli", split="train") ``` ### Projects Included 1505-02-10_Hanserezess,_Lübeck_Dienstag_nach_Scholastice_1505_(SAHST_Rep__2,_I_040-4)
提供机构:
jwidmer
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作