impresso-project/frakturline-testset
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/impresso-project/frakturline-testset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- de
- fr
- lb
license: cc-by-nc-4.0
task_categories:
- image-classification
tags:
- fraktur
- blackletter
- antiqua
- ocr
- historical-newspapers
- historical-documents
- impresso
- text-line
- evaluation
size_categories:
- 1K<n<10K
dataset_revision: v1.0.0
---
# Fraktur/Other Text-Line — Test Set
A **balanced, held-out evaluation set** of 2 000 scanned text-line images (1 000 per class) for the binary task of distinguishing **Fraktur** (blackletter / Gothic script) from **other** script (primarily Antiqua / Latin / Roman).
Developed for the [Impresso](https://impresso-project.ch/) digital humanities project.
---
## Dataset Details
| Property | Value |
| ---------------- | ----------------------------------------------------------- |
| Task | Binary image classification |
| Classes | `fraktur`, `other` |
| Images per class | 1 000 |
| Total images | 2 000 |
| Image format | WebP (lossless), grayscale text-line crops |
| Typical size | variable width × ~60 px height |
| Languages | German, French, Luxembourgish |
| Source | Historical Swiss and Luxembourgish newspapers |
| Sampling | Stratified random sample (seed=42) from the training corpus |
---
## Data Fields
| Field | Type | Description |
| ------------ | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `image` | `PIL.Image` | Grayscale text-line crop |
| `label` | `ClassLabel` | `"fraktur"` or `"other"` |
| `file_name` | `string` | Repo-relative path (`data/test/{label}/{seq_id}.webp`) — sequential ID, carries no provenance |
| `source_tag` | `string` | Original filename stem — partially encodes newspaper/date of origin where known; **not systematic** and should not be relied upon for analysis |
---
## Usage
```python
from datasets import load_dataset
ds = load_dataset("impresso-project/frakturline-testset", split="test")
print(ds[0])
# {"image": <PIL.Image>, "label": "fraktur", "file_name": "data/test/fraktur/0001.webp", "source_tag": "..."}
```
### Evaluate a model
```python
from datasets import load_dataset
from PIL import Image
import torch
from torchvision import transforms
ds = load_dataset("impresso-project/frakturline-testset", split="test")
# Load your model here ...
correct = sum(
1 for item in ds
if predict(item["image"]) == item["label"]
)
print(f"Accuracy: {correct / len(ds):.3f}")
```
---
## Companion Model
The classifier trained and evaluated against this test set is published at:
→ [impresso-project/frakturline-classification-cnn](https://huggingface.co/impresso-project/frakturline-classification-cnn)
---
## Important Caveats
- **`source_tag` is informational only.** It reflects original filenames which are partially human-readable (some encode newspaper ID + date) but the naming is not systematic across the corpus. Do not use it as a structured metadata field.
- **Held-out status.** These images were moved out of the training corpus before training began (`sample_testset.py`, seed=42). They should not be used for training.
- **Class balance.** The test set is artificially balanced (1 000/class). The underlying corpus has a higher proportion of Antiqua text. Reported accuracy on this set does not directly reflect real-world class distributions.
- **Script scope.** `other` consists primarily of Antiqua but may include mixed-typeface lines, decorative elements, or non-Latin scripts. It is not a pure Antiqua set.
---
## Data provenance and upstream licensing
Part of the source material used to construct this dataset was derived from the National Library of Luxembourg (BnL) OCR ground-truth release [`bnl-ground-truth-newspapers-before-1878.zip`](https://data.bnl.lu/wp-content/uploads/2021/07/bnl-ground-truth-newspapers-before-1878.zip). The BnL historical newspapers portal describes these OCR datasets as **Public Domain / CC0** and characterizes the Ground Truth Pack as containing **33,000 transcribed text lines**, including **19,000 Antiqua** and **14,000 Fraktur** lines. See also the dataset overview page: [BnL Historical Newspapers](https://data.bnl.lu/data/historical-newspapers/).
For background on the BnL OCR workflow and data production context, see:
> Schneider, Pit, and Yves Maurer. *Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction*. *Journal of Data Mining & Digital Humanities*, 2022. Article page: [https://jdmdh.episciences.org/10239](https://jdmdh.episciences.org/10239).
The present dataset is not identical to the original BnL release. The upstream material was cleaned by removing lines that did not contain identifiable **Fraktur** or **Antiqua** characters and by excluding **mixed-font** cases, while retaining lines containing digits. Curation was performed iteratively, beginning from the original BnL splits and involving repeated manual inspection of prediction errors across cross-validation splits. For example, lines labeled as **Fraktur** but consistently predicted as **Antiqua**, or vice versa, were reviewed manually. In some cases these were acceptable borderline instances, but many revealed inconsistencies with the dataset guidelines, which were intended to preserve only clear **Fraktur** / **non-Fraktur** distinctions. Such inconsistent cases were removed.
The test set also includes **84 additional lines** from impresso corpus material curated by the dataset creators.
---
**Release version:** `v1.0.0`
This release may receive metadata-only updates after publication. The dataset contents associated with this release are frozen under the Git tag `v1.0.0`. Any modification to the dataset contents, including items or labels, triggers a new release version.
--
## License
This dataset contains material from multiple sources with different rights statuses.
- Part of the source material derives from the National Library of Luxembourg (BnL) OCR ground-truth release [`bnl-ground-truth-newspapers-before-1878.zip`](https://data.bnl.lu/wp-content/uploads/2021/07/bnl-ground-truth-newspapers-before-1878.zip), which the BnL describes as **CC0 / Public Domain**.
- Additional original contributions made by the dataset creators in this release — including curation, selection, cleaning decisions, split construction, and documentation — are made available under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
Users should therefore distinguish between upstream source material, which remains available under its original terms, and original contributions in this release, which are provided under **CC BY 4.0**.
This dataset is intended primarily for **evaluation use**. Do not train models with it! Users remain responsible for assessing the rights status of any upstream source material, especially where historical newspaper provenance or copyright status is not fully documented across jurisdictions.
If you use this dataset, please cite the [Impresso project](https://impresso-project.ch/) and link to this repository.
提供机构:
impresso-project



