atlasia/atlasOCR-data
收藏Hugging Face2025-09-16 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/atlasia/atlasOCR-data
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: image
dtype: image
- name: metadata
struct:
- name: contains_title
dtype: bool
- name: font
dtype: string
splits:
- name: train
num_bytes: 12777223035.970001
num_examples: 26162
- name: validation
num_bytes: 1892329629.54
num_examples: 3930
- name: test
num_bytes: 56546649
num_examples: 196
download_size: 9420060803
dataset_size: 14726099314.510002
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
language:
- ary
size_categories:
- 10K<n<100K
---
# AtlasOCR Darija Dataset
<center>
<img src="https://cdn-uploads.huggingface.co/production/uploads/65f5c3528fb2b1535728138f/W9oSeX75pjvEH2WgelHR-.png" width=700 height=700/>



</center>
## Dataset Description
The AtlasOCR Darija Dataset is the first large-scale OCR dataset specifically designed for Moroccan Darija, the Moroccan Arabic dialect. It was created to address the significant lack of specialized OCR tools for Darija, which has been a barrier for developers and organizations working with Moroccan content.
The dataset combines both synthetic and real-world data sources to capture the rich diversity of Darija text in various contexts, from social media posts to handwritten notes and printed materials.
## Dataset Structure
Each instance in the dataset contains:
- An image containing Darija text
- Corresponding text transcription
- Metadata (where applicable)
### Data Splits
| Split | Samples | Total Words |
|-------------|---------|-------------|
| Train | 26,162 | 9.5M |
| Validation | 3,930 | 1.2M |
| **Total** | **30,092** | **10.7M** |
### Data Composition
- **Synthetic Data**: 86% of the dataset
- **Real-World Data**: 14% of the dataset
### Source Data
#### Synthetic Data
Synthetic data was generated using [OCRSmith](https://github.com/atlasia-ma/OCRSmith), an open-source toolkit developed specifically for this project. OCRSmith simulates real-world conditions including:
- Various fonts
- Different layouts
- Diverse backgrounds
- Text distortions
This approach allowed for the instant generation of tens of thousands of labeled images complete with bounding boxes and metadata.
#### Real-World Data
Real-world data was carefully curated from multiple sources:
1. **Scanned Books**:
- "العَرَبِيَّةُ الدَّارِجَةُ" by Mohammed El-Madlaoui El-Mounabhi
- "علشان الصغيرة والصغير" by Farouk ElMarrakchi
- Approximately 700 pages of high-quality Darija text
- Enriched with pseudo-labels generated by Gemini 2.0 Flash
2. **Social Media Images**:
- Primarily from LinkedIn
- Poster-style PDFs converted to images
- Focus on educational material
3. **Educational Documents**:
- Moroccan driving license exam materials
- Required careful cropping and preprocessing due to faded or cluttered scans
4. **Cookbooks**:
- Moroccan recipes written in Darija
- Decorative elements were cropped out
- Contrast was enhanced for clarity
### Annotation Process
For scanned books, a two-step pseudo-labeling process was used:
1. Initial text extraction using Gemini 2.0 Flash with a prompt prioritizing human readability
2. Human annotation and correction using Argilla for collaborative editing
## Considerations for Using the Data
### Social Impact of Dataset
The dataset enables:
- Digital preservation of historical Moroccan documents
- Analysis of social media content in Darija
- Improved accessibility for Darija speakers
- Large-scale research on Moroccan content
### Discussion of Biases
The dataset contains a mix of synthetic and real-world data, which may introduce certain biases:
- Synthetic data might not perfectly capture all real-world variations
- Real-world data is sourced from specific domains (books, social media, education, cookbooks)
- The dataset may not fully represent all regional variations of Darija
### Other Known Limitations
- The dataset primarily focuses on printed text, with limited handwritten samples
- The synthetic data, while diverse, may not capture all real-world variations
- The dataset is primarily designed for OCR tasks and may not be suitable for other NLP applications without adaptation
## Citation
```
@misc{atlasocr2025,
title={AtlasOCR: Open-Source OCR for Moroccan Darija with Vision–Language Models},
author={Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane},
year={2025},
howpublished={\url{https://huggingface.co/atlasia/AtlasOCR}},
organization={AtlasIA}
}
```
### Contributions
For more information about the AtlasOCR project, visit:
- [AtlasOCR BlogPost](https://huggingface.co/blog/imomayiz/atlasocr)
- [AtlasOCR Model](https://huggingface.co/atlasia/AtlasOCR)
- [AtlasOCR Demo](https://huggingface.co/spaces/atlasia/AtlasOCR-demo)
- [AtlasOCR Training Dataset](https://huggingface.co/datasets/atlasia/atlasOCR-data)
- [GitHub Repository](https://github.com/atlasia/AtlasOCR)
提供机构:
atlasia



