mr3vial/paleo-hebrew-seals-synthetic
收藏Hugging Face2026-04-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mr3vial/paleo-hebrew-seals-synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: PaleoHebrew-Seals Synthetic Corpus
license: cc-by-4.0
language:
- he
task_categories:
- object-detection
- image-classification
- text-generation
tags:
- paleo-hebrew
- epigraphy
- ocr
- synthetic-data
- cultural-heritage
- multimodal
size_categories:
- 100K<n<1M
---
# PaleoHebrew-Seals Synthetic Corpus
This repository hosts the **synthetic corpus** part of **PaleoHebrew-Seals**, a dataset suite for multimodal recognition of Paleo-Hebrew seal inscriptions.
## Why this dataset is needed
Annotated real Paleo-Hebrew seal photographs are scarce. The synthetic corpus is designed to provide large-scale supervision for training and augmentation while preserving explicit structure at the character level.
## Overview
The corpus contains **200,000** synthetic images generated with a two-stage pipeline.
### Stage A: structurally supervised generation
Stage A produces clean Paleo-Hebrew renderings with exact supervision. A document-aware composer samples between two modes:
- seal-like inscriptions generated from epigraphic templates with lexicon-based slot filling
- plain-script snippets sampled from Hebrew text resources to diversify local letter contexts
Text is normalized to a canonical **22-letter** inventory for structural generation. Stage A outputs exact character-level boxes together with synchronized text variants.
### Stage B: style adaptation
Stage B adapts Stage A outputs toward more realistic seal-like imagery while preserving the original text and box supervision. In the released pipeline, structural layout is preserved through diffusion-based conditioning, while surface appearance is adapted toward seal-like texture and lighting.
## What the corpus contains
Representative supervision includes:
- synthetic images
- character sequences
- character-level bounding boxes
- synchronized text variants
- document kind / source information
- font information
- rendering parameters
- generation metadata
Representative fields include:
- `image`
- `chars`
- `bboxes`
- `text_raw`
- `text_norm`
- `text_gt`
- `text_render`
- document kind / source specification
- font metadata
- sampled rendering controls
## Intended use
This repository is intended primarily as a **training and augmentation resource** for:
- character localization
- character classification
- structured post-OCR
- Hebrew transcription
- synthetic-to-real transfer
Evaluation on real seal photographs should be carried out on the companion real benchmark.
## Relationship to the real benchmark
The companion real benchmark is released separately as:
- `mr3vial/paleo-hebrew-seals-unambiguous`
That benchmark contains:
- **307** real seal images
- selected from **350** initial candidates
- split into **157** training and **150** validation examples
## Split policy and leakage control
The synthetic corpus is intended as a training resource. Any real images used for style adaptation are treated as training-only resources and are kept disjoint from benchmark evaluation artifacts at the **seal-entry level**.
## Limitations
This synthetic corpus reflects concrete design decisions about templates, lexicons, normalization, rendering, and stylization. Models trained heavily on this data may inherit biases toward canonicalized forms, formulaic expressions, or the visual priors of the style-adaptation pipeline.
In particular, structural generation uses a canonical **22-letter** inventory. Downstream users should keep this normalization in mind when studying generalization to more varied epigraphic settings.
## Companion resources
- Real benchmark: `mr3vial/paleo-hebrew-seals-unambiguous`
- Demo Space: `https://mr3vial-paleo-hebrew-project.hf.space/`
- Demo video: `https://drive.google.com/file/d/1susDDbaZyFny1Ga9bZXyEVibD4R8YyrW/view`
## Access
This repository is intended to be publicly accessible **without login and without access requests**.
## License
The dataset contents in this repository are released under **CC BY 4.0**.
Companion code, evaluation scripts, and model checkpoints may be documented and licensed separately in their respective repositories.
## Citation
If you use this resource, please cite the dataset paper as follows while the submission is under review:
```bibtex
@misc{gorbulev2026paleohebrewseals,
title={PaleoHebrew-Seals: A Real-and-Synthetic Dataset Suite for Multimodal Recognition of Paleo-Hebrew Seal Inscriptions},
author={Gorbulev, Alex and Humonen, Innokentiy and Golyadkin, Maksim and Makarov, Ilya},
year={2026},
note={Under review}
}
```
## Contact
For questions about the synthetic release, please contact the repository maintainers.
提供机构:
mr3vial



