atlasia/atlasOCR-data

Name: atlasia/atlasOCR-data
Creator: atlasia
Published: 2025-09-16 15:55:11
License: 暂无描述

Hugging Face2025-09-16 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/atlasia/atlasOCR-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: image dtype: image - name: metadata struct: - name: contains_title dtype: bool - name: font dtype: string splits: - name: train num_bytes: 12777223035.970001 num_examples: 26162 - name: validation num_bytes: 1892329629.54 num_examples: 3930 - name: test num_bytes: 56546649 num_examples: 196 download_size: 9420060803 dataset_size: 14726099314.510002 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* language: - ary size_categories: - 10K<n<100K --- # AtlasOCR Darija Dataset <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/65f5c3528fb2b1535728138f/W9oSeX75pjvEH2WgelHR-.png" width=700 height=700/> ![AtlasOCR](https://img.shields.io/badge/Atlas-OCR-black) ![Darija](https://img.shields.io/badge/Language-Darija🇲🇦-red) ![License](https://img.shields.io/badge/License-Apache_2.0-green) </center> ## Dataset Description The AtlasOCR Darija Dataset is the first large-scale OCR dataset specifically designed for Moroccan Darija, the Moroccan Arabic dialect. It was created to address the significant lack of specialized OCR tools for Darija, which has been a barrier for developers and organizations working with Moroccan content. The dataset combines both synthetic and real-world data sources to capture the rich diversity of Darija text in various contexts, from social media posts to handwritten notes and printed materials. ## Dataset Structure Each instance in the dataset contains: - An image containing Darija text - Corresponding text transcription - Metadata (where applicable) ### Data Splits | Split | Samples | Total Words | |-------------|---------|-------------| | Train | 26,162 | 9.5M | | Validation | 3,930 | 1.2M | | **Total** | **30,092** | **10.7M** | ### Data Composition - **Synthetic Data**: 86% of the dataset - **Real-World Data**: 14% of the dataset ### Source Data #### Synthetic Data Synthetic data was generated using [OCRSmith](https://github.com/atlasia-ma/OCRSmith), an open-source toolkit developed specifically for this project. OCRSmith simulates real-world conditions including: - Various fonts - Different layouts - Diverse backgrounds - Text distortions This approach allowed for the instant generation of tens of thousands of labeled images complete with bounding boxes and metadata. #### Real-World Data Real-world data was carefully curated from multiple sources: 1. **Scanned Books**: - "العَرَبِيَّةُ الدَّارِجَةُ" by Mohammed El-Madlaoui El-Mounabhi - "علشان الصغيرة والصغير" by Farouk ElMarrakchi - Approximately 700 pages of high-quality Darija text - Enriched with pseudo-labels generated by Gemini 2.0 Flash 2. **Social Media Images**: - Primarily from LinkedIn - Poster-style PDFs converted to images - Focus on educational material 3. **Educational Documents**: - Moroccan driving license exam materials - Required careful cropping and preprocessing due to faded or cluttered scans 4. **Cookbooks**: - Moroccan recipes written in Darija - Decorative elements were cropped out - Contrast was enhanced for clarity ### Annotation Process For scanned books, a two-step pseudo-labeling process was used: 1. Initial text extraction using Gemini 2.0 Flash with a prompt prioritizing human readability 2. Human annotation and correction using Argilla for collaborative editing ## Considerations for Using the Data ### Social Impact of Dataset The dataset enables: - Digital preservation of historical Moroccan documents - Analysis of social media content in Darija - Improved accessibility for Darija speakers - Large-scale research on Moroccan content ### Discussion of Biases The dataset contains a mix of synthetic and real-world data, which may introduce certain biases: - Synthetic data might not perfectly capture all real-world variations - Real-world data is sourced from specific domains (books, social media, education, cookbooks) - The dataset may not fully represent all regional variations of Darija ### Other Known Limitations - The dataset primarily focuses on printed text, with limited handwritten samples - The synthetic data, while diverse, may not capture all real-world variations - The dataset is primarily designed for OCR tasks and may not be suitable for other NLP applications without adaptation ## Citation ``` @misc{atlasocr2025, title={AtlasOCR: Open-Source OCR for Moroccan Darija with Vision–Language Models}, author={Imane Momayiz, Soufiane Ait Elaouad, Abdeljalil Elmajjodi, Haitame Bouanane}, year={2025}, howpublished={\url{https://huggingface.co/atlasia/AtlasOCR}}, organization={AtlasIA} } ``` ### Contributions For more information about the AtlasOCR project, visit: - [AtlasOCR BlogPost](https://huggingface.co/blog/imomayiz/atlasocr) - [AtlasOCR Model](https://huggingface.co/atlasia/AtlasOCR) - [AtlasOCR Demo](https://huggingface.co/spaces/atlasia/AtlasOCR-demo) - [AtlasOCR Training Dataset](https://huggingface.co/datasets/atlasia/atlasOCR-data) - [GitHub Repository](https://github.com/atlasia/AtlasOCR)

提供机构：

atlasia

5,000+

优质数据集

54 个

任务类型

进入经典数据集