arabic-img2md
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/mohamedrashad/arabic-img2md
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个综合性的合成数据集,包含13,700对阿拉伯语书页及其Markdown格式表示。该数据集主要用于评估阿拉伯语-Nougat模型的表现,使用的评价指标包括Markdown结构准确度和字符错误率。规模达到13,700对数据,任务涉及阿拉伯语的OCR识别和Markdown内容的提取。
This dataset is a comprehensive synthetic dataset comprising 13,700 pairs of Arabic book pages and their Markdown-formatted representations. It is primarily designed to evaluate the performance of the Arabic-Nougat model, with evaluation metrics including Markdown structural accuracy and Character Error Rate. Boasting a total of 13,700 data pairs, the dataset encompasses tasks related to Arabic OCR recognition and Markdown content extraction.



