impresso-project/frakturline-dataset

Name: impresso-project/frakturline-dataset
Creator: impresso-project
Published: 2026-03-24 09:08:39
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/impresso-project/frakturline-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - de - fr - lb license: agpl-3.0 tags: - image-classification - pytorch - fraktur - other - historical-documents - ocr - impresso datasets: - impresso-project/frakturline-dataset - impresso-project/frakturline-testset pipeline_tag: image-classification --- # Fraktur/Other Text-Line Classifier A binary CNN classifier that determines whether a scanned text-line image is set in **Fraktur** (blackletter / Gothic script) or **Other** (primarily Latin / Roman / Antiqua script). Developed for the [Impresso](https://impresso-project.ch/) digital humanities project, which processes millions of historical newspaper pages in German, French, Luxembourgish, and other European languages. --- ## Model Details | Property | Value | | ------------- | -------------------------------------------------------------------------- | | Architecture | `BinaryClassificationCNN` — 3-layer CNN with LayerNorm and Dropout | | Input | Grayscale text-line image, resized/padded to **60 × 800 px** | | Output | Single logit; `logit > 0` → Fraktur (equivalent to `sigmoid(logit) > 0.5`) | | Parameters | ~2.1 M | | Training data | ~32 000 manually labeled line crops from Swiss/Luxembourgish newspapers | | Framework | PyTorch | ### Architecture ``` Input (1, 60, 800) → Conv2d(1→32) + ReLU + MaxPool2d → LayerNorm[32, 30, 400] → Conv2d(32→64) + ReLU + MaxPool2d → LayerNorm[64, 15, 200] + Dropout(0.15) → Conv2d(64→128) + LayerNorm[128,15,200] + ReLU + AdaptiveMaxPool2d(1×8) → Flatten(1 024) → FC(128) + ReLU → FC(1) ``` ### Training - **Loss**: `BCEWithLogitsLoss` - **Optimizer**: Adam, lr = 1e-4 with `ReduceLROnPlateau` (factor 0.5, patience 2) - **Epochs**: up to 20 with early stopping (patience 5) - **Augmentation**: random rotation ±2°, Gaussian noise (σ=0.05), random right-masking (p=0.15, up to 50 % of width) to improve robustness on short lines - **Class balancing**: `WeightedRandomSampler` (other ≈ 20 k, fraktur ≈ 14 k) --- ## Performance Evaluated on the companion held-out test set ([impresso-project/frakturline-testset](https://huggingface.co/datasets/impresso-project/frakturline-testset)) — 2 000 balanced images (1 000 per class), strictly excluded from training: | Metric | Score | | ------------------- | ----------- | | Accuracy | **99.75 %** | | Precision (Fraktur) | **100.0 %** | | Recall (Fraktur) | **99.5 %** | | F1 (Fraktur) | **99.75 %** | | FP / FN | 0 FP / 5 FN | --- ## Evaluation Dataset The test set is published as a separate frozen HF dataset: → [impresso-project/frakturline-testset](https://huggingface.co/datasets/impresso-project/frakturline-testset) It is **not** included in the training corpus. Please consult the dataset card for detailed provenance, revision, and licensing information. Do not use it for training. ```python from datasets import load_dataset ds = load_dataset("impresso-project/frakturline-testset", split="test") # 2 000 images: {"image": <PIL.Image>, "label": "fraktur"|"other", ...} ``` --- ## Usage ### Install dependencies ```bash pip install torch torchvision Pillow huggingface_hub ``` ### Classify images ```python from huggingface_hub import hf_hub_download import importlib.util # Load pipeline.py from the hub spec = importlib.util.spec_from_file_location( "pipeline", hf_hub_download("impresso-project/frakturline-classification-cnn", "pipeline.py"), ) pipeline_module = importlib.util.module_from_spec(spec) spec.loader.exec_module(pipeline_module) pipe = pipeline_module.FrakturPipeline.from_pretrained( "impresso-project/frakturline-classification-cnn" ) # Single image — local path result = pipe("path/to/line.png") # → {"label": "fraktur", "score": 0.9731} # Single image — URL result = pipe("https://example.com/line.png") # Batch results = pipe(["line1.png", "line2.png", "line3.png"]) ``` ### Input format - Any PIL-readable image format (PNG, JPEG, TIFF, …) - Ideally a single text line crop extracted by an OCR layout-analysis tool - The pipeline handles grayscale conversion and resizing internally ### Output format ```python {"label": "fraktur", "score": 0.9731} # sigmoid probability of predicted class {"label": "other", "score": 0.9954} ``` --- ## Limitations - Designed for **single text lines**. Mixed-typeface lines or non-text content may produce unreliable results. - Short headers, ornaments, or lines with very few characters can be ambiguous. - The training data is drawn primarily from 19th- and 20th-century European newspapers; performance on other periods or regions is not guaranteed. --- ## Citation If you use this model, please cite the Impresso project: ```bibtex @misc{impresso2025fraktur, title = {Fraktur/Antiqua Text-Line Classifier}, author = {Impresso Project}, year = {2025}, url = {https://huggingface.co/impresso-project/frakturline-classification-cnn} } ``` --- ## License The code in this repository is released under the **GNU Affero General Public License v3.0 (AGPL-3.0)**. The model was trained on data derived from multiple upstream sources. Rights in the underlying source materials remain subject to their respective original terms. For dataset-specific provenance, revision, and licensing details, please consult the linked dataset cards. If you use this model, please cite the [Impresso project](https://impresso-project.ch/) and link to this repository.

提供机构：

impresso-project

5,000+

优质数据集

54 个

任务类型

进入经典数据集