mdnaseif/hafith-synthetic-1m
收藏Hugging Face2026-02-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mdnaseif/hafith-synthetic-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
task_categories:
- image-to-text
tags:
- ocr
- arabic
- manuscripts
- synthetic-data
size_categories:
- 100K<n<1M
license: mit
---
# HAFITH Synthetic Dataset (1M Samples)
Synthetic dataset of 1 million manuscript-style Arabic text line images for training the HAFITH OCR model.
## Dataset Summary
- **Total Samples**: 1,000,000 (900K train / 50K val / 50K test)
- **Text Source**: ArabicText-Large (244M words)
- **Fonts**: 350 Arabic fonts (Naskh, Ruq'ah, Thuluth, Maghrebi)
- **Image Size**: 800×48 to 2500×128 pixels
- **Backgrounds**: 50 aged parchment variants
## Generation Pipeline
1. **Text Rendering**: Sequential extraction from ArabicText-Large with proper RTL shaping
2. **Degradation**: Stochastic augmentations simulating manuscript artifacts (paper texture, ink degradation, aging effects)
### Key Augmentations
| Type | Augmentations | Probability |
|------|--------------|-------------|
| Geometric | Paper texture, baseline warp | 0.50-0.70 |
| Noise | Gaussian noise, blur | 0.25-0.30 |
| Ink | Erosion, feathering, bleed | 0.15-0.25 |
| Aging | Foxing, stains, show-through | 0.10-0.20 |
## Usage
```python
from datasets import load_dataset
# Load dataset
dataset = load_dataset("mdnaseif/hafith-synthetic-1m")
# Or stream for large datasets
dataset = load_dataset("mdnaseif/hafith-synthetic-1m", streaming=True)
# Access sample
sample = dataset['train'][0]
print(sample['text'])
sample['image'].show()
```
## Why Synthetic Data?
Our ablations show that Aranizer tokenization requires synthetic pretraining:
| Configuration | CER |
|---------------|-----|
| NaFlex + Aranizer (no synthetic) | 8.47% ❌ |
| NaFlex + Aranizer + Synthetic | **5.10%** ✅ |
Without synthetic data, switching to Arabic tokenization *degrades* performance.
## Citation
```bibtex
@article{naseif2026hafith,
title={HAFITH: Aspect-Ratio Preserving Vision-Language Model for Historical Arabic Manuscript Recognition},
author={Naseif, Mohammed and Mesabah, Islam and Hajjaj, Dalia and Hassan, Abdulrahman and Elhayek, Ahmed and Koubaa, Anis},
year={2026}
}
```
## Links
- 🤗 **Model**: [mdnaseif/hafith](https://huggingface.co/mdnaseif/hafith)
- 📊 **Real Benchmark**: [mdnaseif/hafith-combined-benchmark](https://huggingface.co/datasets/mdnaseif/hafith-combined-benchmark)
- 💻 **Code**: [GitHub](https://github.com/mdnaseif/hafith)
## License
MIT License - See LICENSE file for details.
提供机构:
mdnaseif



