NAMAA-Space/QariOCR-v0.3-markdown-mixed-dataset

Name: NAMAA-Space/QariOCR-v0.3-markdown-mixed-dataset
Creator: NAMAA-Space
Published: 2025-06-10 12:34:50
License: 暂无描述

Hugging Face2025-06-10 更新2025-07-05 收录

下载链接：

https://hf-mirror.com/datasets/NAMAA-Space/QariOCR-v0.3-markdown-mixed-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

QARI Markdown Mixed Dataset是一个合成数据集，包含37000个合成阿拉伯文档图像和对应的HTML/Markdown格式真实文本。它支持完整的发音符号，使用12种不同的阿拉伯字体，并具有多样化的文档布局。该数据集用于训练能够理解文档结构的OCR模型，优化视觉效果模型的阿拉伯文字识别，以及开发能够保留格式和布局信息的应用。数据集分为训练集、验证集和测试集，分别占总数的80%、10%和10%。

The QARI Markdown Mixed Dataset is a synthetic dataset containing 37,000 synthetic Arabic document images and corresponding ground truth text in HTML/Markdown format. It supports full diacritics and uses 12 different Arabic fonts with varied document layouts. The dataset is designed for training OCR models that can understand document structure, fine-tuning vision-language models for Arabic text recognition, and developing systems that preserve formatting and layout information. It is divided into a training set, validation set, and test set, accounting for 80%, 10%, and 10% of the total respectively.

提供机构：

NAMAA-Space

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集