Psychias/ocr-mldr

Name: Psychias/ocr-mldr
Creator: Psychias
Published: 2026-04-27 19:21:44
License: 暂无描述

Hugging Face2026-04-27 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Psychias/ocr-mldr

下载链接

链接失效反馈

官方服务：

资源简介：

OCR-MLDR数据集是基于Multi Long Document Retrieval (MLDR)基准的OCR降级版本，旨在评估嵌入模型在嘈杂、类似OCR文本的长文档上的性能。数据集包含多种语言（德语、英语、西班牙语、法语、俄语和阿拉伯语）的2000个文档子样本，每个文档和查询都被渲染为PDF，并通过OCR模拟器重新提取，以引入真实的字符级噪声。数据集提供了原始干净文本和OCR噪声文本，以及原始的MLDR相关性判断（qrels）。数据集还提供了不同DPI和字体大小的配置，以模拟不同质量的OCR文本。

OCR-MLDR is an OCR-degraded version of the Multi Long Document Retrieval (MLDR) benchmark, designed to evaluate embedding models on noisy, OCR-like text with long documents. The dataset includes a 2,000-document subsample per language (German, English, Spanish, French, Russian, and Arabic) drawn from the MLDR test split. Each passage and query was rendered as a PDF at specific DPI/font-size settings and re-extracted via OCR to introduce realistic character-level noise. The dataset provides both the original clean text and the OCR-noised text, along with the original MLDR relevance judgments (qrels). It also includes configurations for different DPI and font sizes to simulate varying OCR text quality.

提供机构：

Psychias

5,000+

优质数据集

54 个

任务类型

进入经典数据集