five

mr3vial/paleo-hebrew-seals-synthetic

收藏
Hugging Face2026-04-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mr3vial/paleo-hebrew-seals-synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: PaleoHebrew-Seals Synthetic Corpus license: cc-by-4.0 language: - he task_categories: - object-detection - image-classification - text-generation tags: - paleo-hebrew - epigraphy - ocr - synthetic-data - cultural-heritage - multimodal size_categories: - 100K<n<1M --- # PaleoHebrew-Seals Synthetic Corpus This repository hosts the **synthetic corpus** part of **PaleoHebrew-Seals**, a dataset suite for multimodal recognition of Paleo-Hebrew seal inscriptions. ## Why this dataset is needed Annotated real Paleo-Hebrew seal photographs are scarce. The synthetic corpus is designed to provide large-scale supervision for training and augmentation while preserving explicit structure at the character level. ## Overview The corpus contains **200,000** synthetic images generated with a two-stage pipeline. ### Stage A: structurally supervised generation Stage A produces clean Paleo-Hebrew renderings with exact supervision. A document-aware composer samples between two modes: - seal-like inscriptions generated from epigraphic templates with lexicon-based slot filling - plain-script snippets sampled from Hebrew text resources to diversify local letter contexts Text is normalized to a canonical **22-letter** inventory for structural generation. Stage A outputs exact character-level boxes together with synchronized text variants. ### Stage B: style adaptation Stage B adapts Stage A outputs toward more realistic seal-like imagery while preserving the original text and box supervision. In the released pipeline, structural layout is preserved through diffusion-based conditioning, while surface appearance is adapted toward seal-like texture and lighting. ## What the corpus contains Representative supervision includes: - synthetic images - character sequences - character-level bounding boxes - synchronized text variants - document kind / source information - font information - rendering parameters - generation metadata Representative fields include: - `image` - `chars` - `bboxes` - `text_raw` - `text_norm` - `text_gt` - `text_render` - document kind / source specification - font metadata - sampled rendering controls ## Intended use This repository is intended primarily as a **training and augmentation resource** for: - character localization - character classification - structured post-OCR - Hebrew transcription - synthetic-to-real transfer Evaluation on real seal photographs should be carried out on the companion real benchmark. ## Relationship to the real benchmark The companion real benchmark is released separately as: - `mr3vial/paleo-hebrew-seals-unambiguous` That benchmark contains: - **307** real seal images - selected from **350** initial candidates - split into **157** training and **150** validation examples ## Split policy and leakage control The synthetic corpus is intended as a training resource. Any real images used for style adaptation are treated as training-only resources and are kept disjoint from benchmark evaluation artifacts at the **seal-entry level**. ## Limitations This synthetic corpus reflects concrete design decisions about templates, lexicons, normalization, rendering, and stylization. Models trained heavily on this data may inherit biases toward canonicalized forms, formulaic expressions, or the visual priors of the style-adaptation pipeline. In particular, structural generation uses a canonical **22-letter** inventory. Downstream users should keep this normalization in mind when studying generalization to more varied epigraphic settings. ## Companion resources - Real benchmark: `mr3vial/paleo-hebrew-seals-unambiguous` - Demo Space: `https://mr3vial-paleo-hebrew-project.hf.space/` - Demo video: `https://drive.google.com/file/d/1susDDbaZyFny1Ga9bZXyEVibD4R8YyrW/view` ## Access This repository is intended to be publicly accessible **without login and without access requests**. ## License The dataset contents in this repository are released under **CC BY 4.0**. Companion code, evaluation scripts, and model checkpoints may be documented and licensed separately in their respective repositories. ## Citation If you use this resource, please cite the dataset paper as follows while the submission is under review: ```bibtex @misc{gorbulev2026paleohebrewseals, title={PaleoHebrew-Seals: A Real-and-Synthetic Dataset Suite for Multimodal Recognition of Paleo-Hebrew Seal Inscriptions}, author={Gorbulev, Alex and Humonen, Innokentiy and Golyadkin, Maksim and Makarov, Ilya}, year={2026}, note={Under review} } ``` ## Contact For questions about the synthetic release, please contact the repository maintainers.
提供机构:
mr3vial
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作