Misraj-DocOCR
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Misraj/Misraj-DocOCR
下载链接
链接失效反馈官方服务:
资源简介:
# Misraj-DocOCR: An Arabic Document OCR Benchmark📄
**Dataset:** [Misraj/Misraj-DocOCR](https://huggingface.co/datasets/Misraj/Misraj-DocOCR)
**Domain:** Arabic Document OCR (text + structure)
**Size:** 400 expertly verified pages (real + synthetic)
**Use cases:** OCR, Document Understanding, Markdown/HTML structure preservation
**Status:** Public 🤝
## ✨ Overview
**Misraj-DocOCR** is a curated, expert-verified benchmark for **Arabic document OCR** with an emphasis on **structure preservation** (Markdown/HTML tables, lists, footnotes, math, watermarks, multi-column, marginalia, etc.). Each page includes high-quality ground truth designed to evaluate both **text fidelity** and **layout/structure fidelity**.
- **Diverse content:** books, reports, forms, scholarly pages, and complex layouts.
- **Expert-verified ground truth:** human-reviewed for text **and** structure.
- **Open & reproducible:** intended for fair comparisons and reliable benchmarking.
---
## 📦 Data format
Each example typically includes:
- `uuid`: id of sample
- `image`: page image (PIL-compatible)
- `markdown`: target transcription with structure
### 🔌 Loading
```python
from datasets import load_dataset
ds = load_dataset("Misraj/Misraj-DocOCR")
split = ds["train"] # or another available split
ex = split[0]
img = ex["image"] # PIL.Image
gt = ex.get("markdown") or ex.get("text")
print(gt[:400])
# img.show() # uncomment in a local environment
```
---
## 🧪 Metrics
We report both **text** and **structure** metrics:
* **Text:** WER ↓, CER ↓, BLEU ↑, ChrF ↑
* **Structure:** **TEDS ↑**, **MARS ↑** (Markdown/HTML structure fidelity)
---
## 🏆 Leaderboard (Misraj-DocOCR)
Best values are **bold**, second-best are <u>underlined</u>.
| Model | WER ↓ | CER ↓ | BLEU ↑ | CHRF ↑ | TEDS ↑ | MARS ↑ |
| ----------------------------- | ---------: | ---------: | ----------: | ----------: | -------: | -----------: |
| **Baseer (ours)** | **0.25** | 0.53 | <u>76.18</u> | <u>87.77</u> | **66** | **76.885** |
| Gemini-2.5-pro | <u>0.37</u> | <u>0.31</u> | **77.92** | **89.55** | <u>52</u> | <u>70.775</u> |
| Azure AI Document Intelligence[^azure] | 0.44 | **0.27** | 62.04 | 82.49 | 42 | 62.245 |
| Dots.ocr | 0.50 | 0.40 | 58.16 | 78.41 | 40 | 59.205 |
| Nanonets | 0.71 | 0.55 | 42.22 | 67.89 | 37 | 52.445 |
| Qari | 0.76 | 0.64 | 38.59 | 64.50 | 21 | 42.750 |
| Qwen2.5-VL-32B | 0.76 | 0.59 | 37.62 | 62.64 | 41 | 51.820 |
| GPT-5 | 0.86 | 0.62 | 40.67 | 61.6 | 48 | 54.8 |
| Qwen2.5-VL-3B-Instruct | 0.87 | 0.71 | 25.39 | 53.42 | 27 | 40.210 |
| Qwen2.5-VL-7B | 0.92 | 0.77 | 31.57 | 54.70 | 27 | 40.850 |
| Gemma3-12B | 0.96 | 0.80 | 19.75 | 44.53 | 33 | 38.765 |
| Gemma3-4B | 1.01 | 0.85 | 9.57 | 31.39 | 28 | 29.695 |
| GPT-4o-mini | 1.36 | 1.10 | 22.63 | 47.04 | 26 | 36.52 |
| AIN | 1.23 | 1.11 | 1.25 | 2.24 | 21 | 11.620 |
| Aya-vision | 1.41 | 1.07 | 2.91 | 9.81 | 26 | 17.905 |
**Highlights:**
* **Baseer (ours)** leads on **WER**, **TEDS**, and **MARS** → strong text & structure fidelity.
* **Gemini-2.5-pro** tops **BLEU/ChrF**; **Azure AI Document Intelligence** attains lowest **CER**.
---
## 📚 How to cite
If you use **Misraj-DocOCR**, please cite:
```bibtex
@misc{hennara2025baseervisionlanguagemodelarabic,
title={Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR},
author={Khalil Hennara and Muhammad Hreden and Mohamed Motasim Hamed and Ahmad Bastati and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan},
year={2025},
eprint={2509.18174},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.18174},
}
```
# Misraj-DocOCR:阿拉伯语文档光学字符识别(Optical Character Recognition, OCR)基准测试集📄
**数据集:** [Misraj/Misraj-DocOCR](https://huggingface.co/datasets/Misraj/Misraj-DocOCR)
**领域:** 阿拉伯语文档OCR(含文本与版式结构)
**数据规模:** 400页经专业校验的文档(包含真实文档与合成文档)
**应用场景:** OCR、文档理解、Markdown/HTML版式结构保留
**开放状态:** 公开可用🤝
## ✨ 概述
**Misraj-DocOCR** 是一款经专业校验、精心整理的阿拉伯语文档OCR基准测试集,重点关注**版式结构保留**(涵盖Markdown/HTML表格、列表、脚注、数学公式、水印、多栏布局、页边批注等)。每份页面均附带高质量标注真值,可用于评估**文本保真度**与**版式/结构保真度**。
- **内容多样性:** 涵盖书籍、报告、表单、学术页面及复杂版式布局。
- **专业校验标注真值:** 经人工审核文本与结构两部分内容。
- **开放可复现:** 旨在支持公平对比与可靠的基准测试。
---
## 📦 数据格式
每个样本通常包含以下字段:
- `uuid`:样本唯一标识符
- `image`:页面图像(兼容PIL库格式)
- `markdown`:带版式结构的目标转录文本
### 🔌 加载方式
python
from datasets import load_dataset
ds = load_dataset("Misraj/Misraj-DocOCR")
split = ds["train"] # 或其他可用拆分
ex = split[0]
img = ex["image"] # PIL.Image
gt = ex.get("markdown") or ex.get("text")
print(gt[:400])
# img.show() # 在本地环境中取消注释以查看图像
---
## 🧪 评估指标
我们同时报告**文本指标**与**结构指标**:
* **文本指标:** 词错误率(Word Error Rate, WER)↓、字符错误率(Character Error Rate, CER)↓、BLEU得分↑、ChrF得分↑
* **结构指标:** **TEDS得分↑**、**MARS得分↑**(用于评估Markdown/HTML版式结构保真度)
---
## 🏆 基准测试排行榜(Misraj-DocOCR)
最优结果以**粗体**标注,次优结果以<u>下划线</u>标注。
| 模型名称 | 词错误率(WER)↓ | 字符错误率(CER)↓ | BLEU得分↑ | ChrF得分↑ | TEDS得分↑ | MARS得分↑ |
| ----------------------------- | ---------: | ---------: | ----------: | ----------: | -------: | -----------: |
| **Baseer(我们自研)** | **0.25** | 0.53 | <u>76.18</u> | <u>87.77</u> | **66** | **76.885** |
| Gemini-2.5-pro | <u>0.37</u> | <u>0.31</u> | **77.92** | **89.55** | <u>52</u> | <u>70.775</u> |
| Azure AI Document Intelligence[^azure] | 0.44 | **0.27** | 62.04 | 82.49 | 42 | 62.245 |
| Dots.ocr | 0.50 | 0.40 | 58.16 | 78.41 | 40 | 59.205 |
| Nanonets | 0.71 | 0.55 | 42.22 | 67.89 | 37 | 52.445 |
| Qari | 0.76 | 0.64 | 38.59 | 64.50 | 21 | 42.750 |
| Qwen2.5-VL-32B | 0.76 | 0.59 | 37.62 | 62.64 | 41 | 51.820 |
| GPT-5 | 0.86 | 0.62 | 40.67 | 61.6 | 48 | 54.8 |
| Qwen2.5-VL-3B-Instruct | 0.87 | 0.71 | 25.39 | 53.42 | 27 | 40.210 |
| Qwen2.5-VL-7B | 0.92 | 0.77 | 31.57 | 54.70 | 27 | 40.850 |
| Gemma3-12B | 0.96 | 0.80 | 19.75 | 44.53 | 33 | 38.765 |
| Gemma3-4B | 1.01 | 0.85 | 9.57 | 31.39 | 28 | 29.695 |
| GPT-4o-mini | 1.36 | 1.10 | 22.63 | 47.04 | 26 | 36.52 |
| AIN | 1.23 | 1.11 | 1.25 | 2.24 | 21 | 11.620 |
| Aya-vision | 1.41 | 1.07 | 2.91 | 9.81 | 26 | 17.905 |
**榜单亮点:**
* **Baseer(我们自研)** 在词错误率(WER)、TEDS得分与MARS得分上位列第一 → 展现出优异的文本与结构保真度。
* **Gemini-2.5-pro** 在BLEU与ChrF得分上登顶;**Azure AI Document Intelligence** 获得最低的字符错误率(CER)。
---
## 📚 引用方式
若您使用**Misraj-DocOCR**,请引用以下文献:
bibtex
@misc{hennara2025baseervisionlanguagemodelarabic,
title={Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR},
author={Khalil Hennara and Muhammad Hreden and Mohamed Motasim Hamed and Ahmad Bastati and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan},
year={2025},
eprint={2509.18174},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.18174},
}
提供机构:
maas
创建时间:
2025-09-25



