five

mdnaseif/hafith-synthetic-1m

收藏
Hugging Face2026-02-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mdnaseif/hafith-synthetic-1m
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar task_categories: - image-to-text tags: - ocr - arabic - manuscripts - synthetic-data size_categories: - 100K<n<1M license: mit --- # HAFITH Synthetic Dataset (1M Samples) Synthetic dataset of 1 million manuscript-style Arabic text line images for training the HAFITH OCR model. ## Dataset Summary - **Total Samples**: 1,000,000 (900K train / 50K val / 50K test) - **Text Source**: ArabicText-Large (244M words) - **Fonts**: 350 Arabic fonts (Naskh, Ruq'ah, Thuluth, Maghrebi) - **Image Size**: 800×48 to 2500×128 pixels - **Backgrounds**: 50 aged parchment variants ## Generation Pipeline 1. **Text Rendering**: Sequential extraction from ArabicText-Large with proper RTL shaping 2. **Degradation**: Stochastic augmentations simulating manuscript artifacts (paper texture, ink degradation, aging effects) ### Key Augmentations | Type | Augmentations | Probability | |------|--------------|-------------| | Geometric | Paper texture, baseline warp | 0.50-0.70 | | Noise | Gaussian noise, blur | 0.25-0.30 | | Ink | Erosion, feathering, bleed | 0.15-0.25 | | Aging | Foxing, stains, show-through | 0.10-0.20 | ## Usage ```python from datasets import load_dataset # Load dataset dataset = load_dataset("mdnaseif/hafith-synthetic-1m") # Or stream for large datasets dataset = load_dataset("mdnaseif/hafith-synthetic-1m", streaming=True) # Access sample sample = dataset['train'][0] print(sample['text']) sample['image'].show() ``` ## Why Synthetic Data? Our ablations show that Aranizer tokenization requires synthetic pretraining: | Configuration | CER | |---------------|-----| | NaFlex + Aranizer (no synthetic) | 8.47% ❌ | | NaFlex + Aranizer + Synthetic | **5.10%** ✅ | Without synthetic data, switching to Arabic tokenization *degrades* performance. ## Citation ```bibtex @article{naseif2026hafith, title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition}, author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis}, year={2026} } ``` ## Links - 🤗 **Model**: [mdnaseif/hafith](https://huggingface.co/mdnaseif/hafith) - 📊 **Real Benchmark**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark) - 💻 **Code**: [GitHub](https://github.com/mdnaseif/hafith) ## License MIT License - See LICENSE file for details.
提供机构:
mdnaseif
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作