presightai/arabic_doc_to_markdown
收藏Hugging Face2025-07-04 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/presightai/arabic_doc_to_markdown
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含专门为文档结构检索和重建任务策划的OCR图像-markdown配对,涵盖阿拉伯语和英语文档。数据来源于官方阿拉伯政府文件门户、阿拉伯新闻网站和在线杂志以及社区和论坛档案中的结构化阿拉伯语-英语混合内容。每个PDF文件被拆分为单独的页面,并将每页转换为高质量的PNG图像。然后使用大型语言模型直接在图像上生成markdown格式化输出,以捕捉文本和结构内容。
This dataset contains OCR image-markdown pairs specifically curated for document structure retrieval and reconstruction tasks, focusing on Arabic and English documents. The data sources include official Arabic government document portals, Arabic news websites and online magazines, and community and forum archives with structured Arabic-English mixed content. Each PDF is split into individual pages and converted to high-quality PNG images. Large language models are then applied directly over the images to generate markdown-formatted outputs capturing both textual and structural content.
提供机构:
presightai



