opendatalab/OHR-Bench
收藏Hugging Face2025-08-28 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/opendatalab/OHR-Bench
下载链接
链接失效反馈官方服务:
资源简介:
OHR-Bench数据集是一个用于评估OCR对检索增强生成(RAG)级联影响的数据集,包含8500多页来自不同领域的非结构化PDF文档和8498个Q&A数据集,每个PDF页面都配有经过人工验证的地面真实结构化数据。数据集还提供了不同程度的语义噪声和格式化噪声,以模拟真实世界OCR错误。
The OHR-Bench dataset is designed to evaluate the cascading impact of OCR on Retrieval-Augmented Generation (RAG). It includes over 8500 unstructured PDF pages from various domains and 8498 Q&A datasets, each PDF page is equipped with human-verified ground truth structured data. The dataset also introduces semantic noise and formatting noise at various levels to simulate real-world OCR errors.
提供机构:
opendatalab



