Psychias/ocr-mldr
收藏Hugging Face2026-04-27 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Psychias/ocr-mldr
下载链接
链接失效反馈官方服务:
资源简介:
OCR-MLDR数据集是基于Multi Long Document Retrieval (MLDR)基准的OCR降级版本,旨在评估嵌入模型在嘈杂、类似OCR文本的长文档上的性能。数据集包含多种语言(德语、英语、西班牙语、法语、俄语和阿拉伯语)的2000个文档子样本,每个文档和查询都被渲染为PDF,并通过OCR模拟器重新提取,以引入真实的字符级噪声。数据集提供了原始干净文本和OCR噪声文本,以及原始的MLDR相关性判断(qrels)。数据集还提供了不同DPI和字体大小的配置,以模拟不同质量的OCR文本。
OCR-MLDR is an OCR-degraded version of the Multi Long Document Retrieval (MLDR) benchmark, designed to evaluate embedding models on noisy, OCR-like text with long documents. The dataset includes a 2,000-document subsample per language (German, English, Spanish, French, Russian, and Arabic) drawn from the MLDR test split. Each passage and query was rendered as a PDF at specific DPI/font-size settings and re-extracted via OCR to introduce realistic character-level noise. The dataset provides both the original clean text and the OCR-noised text, along with the original MLDR relevance judgments (qrels). It also includes configurations for different DPI and font sizes to simulate varying OCR text quality.
提供机构:
Psychias



