five

impresso-project/frakturline-testset

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/impresso-project/frakturline-testset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - de - fr - lb license: cc-by-nc-4.0 task_categories: - image-classification tags: - fraktur - blackletter - antiqua - ocr - historical-newspapers - historical-documents - impresso - text-line - evaluation size_categories: - 1K<n<10K dataset_revision: v1.0.0 --- # Fraktur/Other Text-Line — Test Set A **balanced, held-out evaluation set** of 2 000 scanned text-line images (1 000 per class) for the binary task of distinguishing **Fraktur** (blackletter / Gothic script) from **other** script (primarily Antiqua / Latin / Roman). Developed for the [Impresso](https://impresso-project.ch/) digital humanities project. --- ## Dataset Details | Property | Value | | ---------------- | ----------------------------------------------------------- | | Task | Binary image classification | | Classes | `fraktur`, `other` | | Images per class | 1 000 | | Total images | 2 000 | | Image format | WebP (lossless), grayscale text-line crops | | Typical size | variable width × ~60 px height | | Languages | German, French, Luxembourgish | | Source | Historical Swiss and Luxembourgish newspapers | | Sampling | Stratified random sample (seed=42) from the training corpus | --- ## Data Fields | Field | Type | Description | | ------------ | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- | | `image` | `PIL.Image` | Grayscale text-line crop | | `label` | `ClassLabel` | `"fraktur"` or `"other"` | | `file_name` | `string` | Repo-relative path (`data/test/{label}/{seq_id}.webp`) — sequential ID, carries no provenance | | `source_tag` | `string` | Original filename stem — partially encodes newspaper/date of origin where known; **not systematic** and should not be relied upon for analysis | --- ## Usage ```python from datasets import load_dataset ds = load_dataset("impresso-project/frakturline-testset", split="test") print(ds[0]) # {"image": <PIL.Image>, "label": "fraktur", "file_name": "data/test/fraktur/0001.webp", "source_tag": "..."} ``` ### Evaluate a model ```python from datasets import load_dataset from PIL import Image import torch from torchvision import transforms ds = load_dataset("impresso-project/frakturline-testset", split="test") # Load your model here ... correct = sum( 1 for item in ds if predict(item["image"]) == item["label"] ) print(f"Accuracy: {correct / len(ds):.3f}") ``` --- ## Companion Model The classifier trained and evaluated against this test set is published at: → [impresso-project/frakturline-classification-cnn](https://huggingface.co/impresso-project/frakturline-classification-cnn) --- ## Important Caveats - **`source_tag` is informational only.** It reflects original filenames which are partially human-readable (some encode newspaper ID + date) but the naming is not systematic across the corpus. Do not use it as a structured metadata field. - **Held-out status.** These images were moved out of the training corpus before training began (`sample_testset.py`, seed=42). They should not be used for training. - **Class balance.** The test set is artificially balanced (1 000/class). The underlying corpus has a higher proportion of Antiqua text. Reported accuracy on this set does not directly reflect real-world class distributions. - **Script scope.** `other` consists primarily of Antiqua but may include mixed-typeface lines, decorative elements, or non-Latin scripts. It is not a pure Antiqua set. --- ## Data provenance and upstream licensing Part of the source material used to construct this dataset was derived from the National Library of Luxembourg (BnL) OCR ground-truth release [`bnl-ground-truth-newspapers-before-1878.zip`](https://data.bnl.lu/wp-content/uploads/2021/07/bnl-ground-truth-newspapers-before-1878.zip). The BnL historical newspapers portal describes these OCR datasets as **Public Domain / CC0** and characterizes the Ground Truth Pack as containing **33,000 transcribed text lines**, including **19,000 Antiqua** and **14,000 Fraktur** lines. See also the dataset overview page: [BnL Historical Newspapers](https://data.bnl.lu/data/historical-newspapers/). For background on the BnL OCR workflow and data production context, see: > Schneider, Pit, and Yves Maurer. *Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction*. *Journal of Data Mining & Digital Humanities*, 2022. Article page: [https://jdmdh.episciences.org/10239](https://jdmdh.episciences.org/10239). The present dataset is not identical to the original BnL release. The upstream material was cleaned by removing lines that did not contain identifiable **Fraktur** or **Antiqua** characters and by excluding **mixed-font** cases, while retaining lines containing digits. Curation was performed iteratively, beginning from the original BnL splits and involving repeated manual inspection of prediction errors across cross-validation splits. For example, lines labeled as **Fraktur** but consistently predicted as **Antiqua**, or vice versa, were reviewed manually. In some cases these were acceptable borderline instances, but many revealed inconsistencies with the dataset guidelines, which were intended to preserve only clear **Fraktur** / **non-Fraktur** distinctions. Such inconsistent cases were removed. The test set also includes **84 additional lines** from impresso corpus material curated by the dataset creators. --- **Release version:** `v1.0.0` This release may receive metadata-only updates after publication. The dataset contents associated with this release are frozen under the Git tag `v1.0.0`. Any modification to the dataset contents, including items or labels, triggers a new release version. -- ## License This dataset contains material from multiple sources with different rights statuses. - Part of the source material derives from the National Library of Luxembourg (BnL) OCR ground-truth release [`bnl-ground-truth-newspapers-before-1878.zip`](https://data.bnl.lu/wp-content/uploads/2021/07/bnl-ground-truth-newspapers-before-1878.zip), which the BnL describes as **CC0 / Public Domain**. - Additional original contributions made by the dataset creators in this release — including curation, selection, cleaning decisions, split construction, and documentation — are made available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). Users should therefore distinguish between upstream source material, which remains available under its original terms, and original contributions in this release, which are provided under **CC BY 4.0**. This dataset is intended primarily for **evaluation use**. Do not train models with it! Users remain responsible for assessing the rights status of any upstream source material, especially where historical newspaper provenance or copyright status is not fully documented across jurisdictions. If you use this dataset, please cite the [Impresso project](https://impresso-project.ch/) and link to this repository.
提供机构:
impresso-project
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作