下载链接：

https://modelscope.cn/datasets/Misraj/Misraj-DocOCR

下载链接

链接失效反馈

官方服务：

资源简介：

# Misraj-DocOCR: An Arabic Document OCR Benchmark📄 **Dataset:** [Misraj/Misraj-DocOCR](https://huggingface.co/datasets/Misraj/Misraj-DocOCR) **Domain:** Arabic Document OCR (text + structure) **Size:** 400 expertly verified pages (real + synthetic) **Use cases:** OCR, Document Understanding, Markdown/HTML structure preservation **Status:** Public 🤝 ## ✨ Overview **Misraj-DocOCR** is a curated, expert-verified benchmark for **Arabic document OCR** with an emphasis on **structure preservation** (Markdown/HTML tables, lists, footnotes, math, watermarks, multi-column, marginalia, etc.). Each page includes high-quality ground truth designed to evaluate both **text fidelity** and **layout/structure fidelity**. - **Diverse content:** books, reports, forms, scholarly pages, and complex layouts. - **Expert-verified ground truth:** human-reviewed for text **and** structure. - **Open & reproducible:** intended for fair comparisons and reliable benchmarking. --- ## 📦 Data format Each example typically includes: - `uuid`: id of sample - `image`: page image (PIL-compatible) - `markdown`: target transcription with structure ### 🔌 Loading ```python from datasets import load_dataset ds = load_dataset("Misraj/Misraj-DocOCR") split = ds["train"] # or another available split ex = split[0] img = ex["image"] # PIL.Image gt = ex.get("markdown") or ex.get("text") print(gt[:400]) # img.show() # uncomment in a local environment ``` --- ## 🧪 Metrics We report both **text** and **structure** metrics: * **Text:** WER ↓, CER ↓, BLEU ↑, ChrF ↑ * **Structure:** **TEDS ↑**, **MARS ↑** (Markdown/HTML structure fidelity) --- ## 🏆 Leaderboard (Misraj-DocOCR) Best values are **bold**, second-best are underlined. | Model | WER ↓ | CER ↓ | BLEU ↑ | CHRF ↑ | TEDS ↑ | MARS ↑ | | ----------------------------- | ---------: | ---------: | ----------: | ----------: | -------: | -----------: | | **Baseer (ours)** | **0.25** | 0.53 | 76.18 | 87.77 | **66** | **76.885** | | Gemini-2.5-pro | 0.37 | 0.31 | **77.92** | **89.55** | 52 | 70.775 | | Azure AI Document Intelligence[^azure] | 0.44 | **0.27** | 62.04 | 82.49 | 42 | 62.245 | | Dots.ocr | 0.50 | 0.40 | 58.16 | 78.41 | 40 | 59.205 | | Nanonets | 0.71 | 0.55 | 42.22 | 67.89 | 37 | 52.445 | | Qari | 0.76 | 0.64 | 38.59 | 64.50 | 21 | 42.750 | | Qwen2.5-VL-32B | 0.76 | 0.59 | 37.62 | 62.64 | 41 | 51.820 | | GPT-5 | 0.86 | 0.62 | 40.67 | 61.6 | 48 | 54.8 | | Qwen2.5-VL-3B-Instruct | 0.87 | 0.71 | 25.39 | 53.42 | 27 | 40.210 | | Qwen2.5-VL-7B | 0.92 | 0.77 | 31.57 | 54.70 | 27 | 40.850 | | Gemma3-12B | 0.96 | 0.80 | 19.75 | 44.53 | 33 | 38.765 | | Gemma3-4B | 1.01 | 0.85 | 9.57 | 31.39 | 28 | 29.695 | | GPT-4o-mini | 1.36 | 1.10 | 22.63 | 47.04 | 26 | 36.52 | | AIN | 1.23 | 1.11 | 1.25 | 2.24 | 21 | 11.620 | | Aya-vision | 1.41 | 1.07 | 2.91 | 9.81 | 26 | 17.905 | **Highlights:** * **Baseer (ours)** leads on **WER**, **TEDS**, and **MARS** → strong text & structure fidelity. * **Gemini-2.5-pro** tops **BLEU/ChrF**; **Azure AI Document Intelligence** attains lowest **CER**. --- ## 📚 How to cite If you use **Misraj-DocOCR**, please cite: ```bibtex @misc{hennara2025baseervisionlanguagemodelarabic, title={Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR}, author={Khalil Hennara and Muhammad Hreden and Mohamed Motasim Hamed and Ahmad Bastati and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan}, year={2025}, eprint={2509.18174}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.18174}, } ```

# Misraj-DocOCR：阿拉伯语文档光学字符识别（Optical Character Recognition, OCR）基准测试集📄 **数据集：** [Misraj/Misraj-DocOCR](https://huggingface.co/datasets/Misraj/Misraj-DocOCR) **领域：** 阿拉伯语文档OCR（含文本与版式结构） **数据规模：** 400页经专业校验的文档（包含真实文档与合成文档） **应用场景：** OCR、文档理解、Markdown/HTML版式结构保留 **开放状态：** 公开可用🤝 ## ✨ 概述 **Misraj-DocOCR** 是一款经专业校验、精心整理的阿拉伯语文档OCR基准测试集，重点关注**版式结构保留**（涵盖Markdown/HTML表格、列表、脚注、数学公式、水印、多栏布局、页边批注等）。每份页面均附带高质量标注真值，可用于评估**文本保真度**与**版式/结构保真度**。 - **内容多样性：** 涵盖书籍、报告、表单、学术页面及复杂版式布局。 - **专业校验标注真值：** 经人工审核文本与结构两部分内容。 - **开放可复现：** 旨在支持公平对比与可靠的基准测试。 --- ## 📦 数据格式每个样本通常包含以下字段： - `uuid`：样本唯一标识符 - `image`：页面图像（兼容PIL库格式） - `markdown`：带版式结构的目标转录文本 ### 🔌 加载方式 python from datasets import load_dataset ds = load_dataset("Misraj/Misraj-DocOCR") split = ds["train"] # 或其他可用拆分 ex = split[0] img = ex["image"] # PIL.Image gt = ex.get("markdown") or ex.get("text") print(gt[:400]) # img.show() # 在本地环境中取消注释以查看图像 --- ## 🧪 评估指标我们同时报告**文本指标**与**结构指标**： * **文本指标：** 词错误率（Word Error Rate, WER）↓、字符错误率（Character Error Rate, CER）↓、BLEU得分↑、ChrF得分↑ * **结构指标：** **TEDS得分↑**、**MARS得分↑**（用于评估Markdown/HTML版式结构保真度） --- ## 🏆 基准测试排行榜（Misraj-DocOCR）最优结果以**粗体**标注，次优结果以下划线标注。 | 模型名称 | 词错误率（WER）↓ | 字符错误率（CER）↓ | BLEU得分↑ | ChrF得分↑ | TEDS得分↑ | MARS得分↑ | | ----------------------------- | ---------: | ---------: | ----------: | ----------: | -------: | -----------: | | **Baseer（我们自研）** | **0.25** | 0.53 | 76.18 | 87.77 | **66** | **76.885** | | Gemini-2.5-pro | 0.37 | 0.31 | **77.92** | **89.55** | 52 | 70.775 | | Azure AI Document Intelligence[^azure] | 0.44 | **0.27** | 62.04 | 82.49 | 42 | 62.245 | | Dots.ocr | 0.50 | 0.40 | 58.16 | 78.41 | 40 | 59.205 | | Nanonets | 0.71 | 0.55 | 42.22 | 67.89 | 37 | 52.445 | | Qari | 0.76 | 0.64 | 38.59 | 64.50 | 21 | 42.750 | | Qwen2.5-VL-32B | 0.76 | 0.59 | 37.62 | 62.64 | 41 | 51.820 | | GPT-5 | 0.86 | 0.62 | 40.67 | 61.6 | 48 | 54.8 | | Qwen2.5-VL-3B-Instruct | 0.87 | 0.71 | 25.39 | 53.42 | 27 | 40.210 | | Qwen2.5-VL-7B | 0.92 | 0.77 | 31.57 | 54.70 | 27 | 40.850 | | Gemma3-12B | 0.96 | 0.80 | 19.75 | 44.53 | 33 | 38.765 | | Gemma3-4B | 1.01 | 0.85 | 9.57 | 31.39 | 28 | 29.695 | | GPT-4o-mini | 1.36 | 1.10 | 22.63 | 47.04 | 26 | 36.52 | | AIN | 1.23 | 1.11 | 1.25 | 2.24 | 21 | 11.620 | | Aya-vision | 1.41 | 1.07 | 2.91 | 9.81 | 26 | 17.905 | **榜单亮点：** * **Baseer（我们自研）** 在词错误率（WER）、TEDS得分与MARS得分上位列第一 → 展现出优异的文本与结构保真度。 * **Gemini-2.5-pro** 在BLEU与ChrF得分上登顶；**Azure AI Document Intelligence** 获得最低的字符错误率（CER）。 --- ## 📚 引用方式若您使用**Misraj-DocOCR**，请引用以下文献： bibtex @misc{hennara2025baseervisionlanguagemodelarabic, title={Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR}, author={Khalil Hennara and Muhammad Hreden and Mohamed Motasim Hamed and Ahmad Bastati and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan}, year={2025}, eprint={2509.18174}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.18174}, }

应用场景：