five

Misraj-DocOCR

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Misraj/Misraj-DocOCR
下载链接
链接失效反馈
官方服务:
资源简介:
# Misraj-DocOCR: An Arabic Document OCR Benchmark📄 **Dataset:** [Misraj/Misraj-DocOCR](https://huggingface.co/datasets/Misraj/Misraj-DocOCR) **Domain:** Arabic Document OCR (text + structure) **Size:** 400 expertly verified pages (real + synthetic) **Use cases:** OCR, Document Understanding, Markdown/HTML structure preservation **Status:** Public 🤝 ## ✨ Overview **Misraj-DocOCR** is a curated, expert-verified benchmark for **Arabic document OCR** with an emphasis on **structure preservation** (Markdown/HTML tables, lists, footnotes, math, watermarks, multi-column, marginalia, etc.). Each page includes high-quality ground truth designed to evaluate both **text fidelity** and **layout/structure fidelity**. - **Diverse content:** books, reports, forms, scholarly pages, and complex layouts. - **Expert-verified ground truth:** human-reviewed for text **and** structure. - **Open & reproducible:** intended for fair comparisons and reliable benchmarking. --- ## 📦 Data format Each example typically includes: - `uuid`: id of sample - `image`: page image (PIL-compatible) - `markdown`: target transcription with structure ### 🔌 Loading ```python from datasets import load_dataset ds = load_dataset("Misraj/Misraj-DocOCR") split = ds["train"] # or another available split ex = split[0] img = ex["image"] # PIL.Image gt = ex.get("markdown") or ex.get("text") print(gt[:400]) # img.show() # uncomment in a local environment ``` --- ## 🧪 Metrics We report both **text** and **structure** metrics: * **Text:** WER ↓, CER ↓, BLEU ↑, ChrF ↑ * **Structure:** **TEDS ↑**, **MARS ↑** (Markdown/HTML structure fidelity) --- ## 🏆 Leaderboard (Misraj-DocOCR) Best values are **bold**, second-best are <u>underlined</u>. | Model | WER ↓ | CER ↓ | BLEU ↑ | CHRF ↑ | TEDS ↑ | MARS ↑ | | ----------------------------- | ---------: | ---------: | ----------: | ----------: | -------: | -----------: | | **Baseer (ours)** | **0.25** | 0.53 | <u>76.18</u> | <u>87.77</u> | **66** | **76.885** | | Gemini-2.5-pro | <u>0.37</u> | <u>0.31</u> | **77.92** | **89.55** | <u>52</u> | <u>70.775</u> | | Azure AI Document Intelligence[^azure] | 0.44 | **0.27** | 62.04 | 82.49 | 42 | 62.245 | | Dots.ocr | 0.50 | 0.40 | 58.16 | 78.41 | 40 | 59.205 | | Nanonets | 0.71 | 0.55 | 42.22 | 67.89 | 37 | 52.445 | | Qari | 0.76 | 0.64 | 38.59 | 64.50 | 21 | 42.750 | | Qwen2.5-VL-32B | 0.76 | 0.59 | 37.62 | 62.64 | 41 | 51.820 | | GPT-5 | 0.86 | 0.62 | 40.67 | 61.6 | 48 | 54.8 | | Qwen2.5-VL-3B-Instruct | 0.87 | 0.71 | 25.39 | 53.42 | 27 | 40.210 | | Qwen2.5-VL-7B | 0.92 | 0.77 | 31.57 | 54.70 | 27 | 40.850 | | Gemma3-12B | 0.96 | 0.80 | 19.75 | 44.53 | 33 | 38.765 | | Gemma3-4B | 1.01 | 0.85 | 9.57 | 31.39 | 28 | 29.695 | | GPT-4o-mini | 1.36 | 1.10 | 22.63 | 47.04 | 26 | 36.52 | | AIN | 1.23 | 1.11 | 1.25 | 2.24 | 21 | 11.620 | | Aya-vision | 1.41 | 1.07 | 2.91 | 9.81 | 26 | 17.905 | **Highlights:** * **Baseer (ours)** leads on **WER**, **TEDS**, and **MARS** → strong text & structure fidelity. * **Gemini-2.5-pro** tops **BLEU/ChrF**; **Azure AI Document Intelligence** attains lowest **CER**. --- ## 📚 How to cite If you use **Misraj-DocOCR**, please cite: ```bibtex @misc{hennara2025baseervisionlanguagemodelarabic, title={Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR}, author={Khalil Hennara and Muhammad Hreden and Mohamed Motasim Hamed and Ahmad Bastati and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan}, year={2025}, eprint={2509.18174}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.18174}, } ```

# Misraj-DocOCR:阿拉伯语文档光学字符识别(Optical Character Recognition, OCR)基准测试集📄 **数据集:** [Misraj/Misraj-DocOCR](https://huggingface.co/datasets/Misraj/Misraj-DocOCR) **领域:** 阿拉伯语文档OCR(含文本与版式结构) **数据规模:** 400页经专业校验的文档(包含真实文档与合成文档) **应用场景:** OCR、文档理解、Markdown/HTML版式结构保留 **开放状态:** 公开可用🤝 ## ✨ 概述 **Misraj-DocOCR** 是一款经专业校验、精心整理的阿拉伯语文档OCR基准测试集,重点关注**版式结构保留**(涵盖Markdown/HTML表格、列表、脚注、数学公式、水印、多栏布局、页边批注等)。每份页面均附带高质量标注真值,可用于评估**文本保真度**与**版式/结构保真度**。 - **内容多样性:** 涵盖书籍、报告、表单、学术页面及复杂版式布局。 - **专业校验标注真值:** 经人工审核文本与结构两部分内容。 - **开放可复现:** 旨在支持公平对比与可靠的基准测试。 --- ## 📦 数据格式 每个样本通常包含以下字段: - `uuid`:样本唯一标识符 - `image`:页面图像(兼容PIL库格式) - `markdown`:带版式结构的目标转录文本 ### 🔌 加载方式 python from datasets import load_dataset ds = load_dataset("Misraj/Misraj-DocOCR") split = ds["train"] # 或其他可用拆分 ex = split[0] img = ex["image"] # PIL.Image gt = ex.get("markdown") or ex.get("text") print(gt[:400]) # img.show() # 在本地环境中取消注释以查看图像 --- ## 🧪 评估指标 我们同时报告**文本指标**与**结构指标**: * **文本指标:** 词错误率(Word Error Rate, WER)↓、字符错误率(Character Error Rate, CER)↓、BLEU得分↑、ChrF得分↑ * **结构指标:** **TEDS得分↑**、**MARS得分↑**(用于评估Markdown/HTML版式结构保真度) --- ## 🏆 基准测试排行榜(Misraj-DocOCR) 最优结果以**粗体**标注,次优结果以<u>下划线</u>标注。 | 模型名称 | 词错误率(WER)↓ | 字符错误率(CER)↓ | BLEU得分↑ | ChrF得分↑ | TEDS得分↑ | MARS得分↑ | | ----------------------------- | ---------: | ---------: | ----------: | ----------: | -------: | -----------: | | **Baseer(我们自研)** | **0.25** | 0.53 | <u>76.18</u> | <u>87.77</u> | **66** | **76.885** | | Gemini-2.5-pro | <u>0.37</u> | <u>0.31</u> | **77.92** | **89.55** | <u>52</u> | <u>70.775</u> | | Azure AI Document Intelligence[^azure] | 0.44 | **0.27** | 62.04 | 82.49 | 42 | 62.245 | | Dots.ocr | 0.50 | 0.40 | 58.16 | 78.41 | 40 | 59.205 | | Nanonets | 0.71 | 0.55 | 42.22 | 67.89 | 37 | 52.445 | | Qari | 0.76 | 0.64 | 38.59 | 64.50 | 21 | 42.750 | | Qwen2.5-VL-32B | 0.76 | 0.59 | 37.62 | 62.64 | 41 | 51.820 | | GPT-5 | 0.86 | 0.62 | 40.67 | 61.6 | 48 | 54.8 | | Qwen2.5-VL-3B-Instruct | 0.87 | 0.71 | 25.39 | 53.42 | 27 | 40.210 | | Qwen2.5-VL-7B | 0.92 | 0.77 | 31.57 | 54.70 | 27 | 40.850 | | Gemma3-12B | 0.96 | 0.80 | 19.75 | 44.53 | 33 | 38.765 | | Gemma3-4B | 1.01 | 0.85 | 9.57 | 31.39 | 28 | 29.695 | | GPT-4o-mini | 1.36 | 1.10 | 22.63 | 47.04 | 26 | 36.52 | | AIN | 1.23 | 1.11 | 1.25 | 2.24 | 21 | 11.620 | | Aya-vision | 1.41 | 1.07 | 2.91 | 9.81 | 26 | 17.905 | **榜单亮点:** * **Baseer(我们自研)** 在词错误率(WER)、TEDS得分与MARS得分上位列第一 → 展现出优异的文本与结构保真度。 * **Gemini-2.5-pro** 在BLEU与ChrF得分上登顶;**Azure AI Document Intelligence** 获得最低的字符错误率(CER)。 --- ## 📚 引用方式 若您使用**Misraj-DocOCR**,请引用以下文献: bibtex @misc{hennara2025baseervisionlanguagemodelarabic, title={Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR}, author={Khalil Hennara and Muhammad Hreden and Mohamed Motasim Hamed and Ahmad Bastati and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan}, year={2025}, eprint={2509.18174}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.18174}, }
提供机构:
maas
创建时间:
2025-09-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作