antoineedy/vidore_v3_test_hr_mteb_format

Name: antoineedy/vidore_v3_test_hr_mteb_format
Creator: antoineedy
Published: 2025-12-19 09:12:21
License: 暂无描述

Hugging Face2025-12-19 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/antoineedy/vidore_v3_test_hr_mteb_format

下载链接

链接失效反馈

官方服务：

资源简介：

Vidore3HrRetrieval是一个多语言文档检索数据集，属于MTEB(大规模文本嵌入基准)的一部分。该数据集包含欧盟发布的高分辨率报告，专门设计用于复杂文档理解任务。原始查询为英文，后被翻译成法语、德语、意大利语、葡萄牙语和西班牙语。数据集支持视觉文档检索、图像到文本和文本到图像等多种任务类型，包含语料库(corpus)、查询相关文档(qrels)和查询(queries)三个主要部分，每种语言都有独立配置。统计显示数据集包含8568个测试样本，涉及237551个字符，平均查询长度124.5字符，平均每查询有5.44个相关文档。

Vidore3HrRetrieval is a multilingual document retrieval dataset part of the MTEB (Massive Text Embedding Benchmark). It contains high-resolution reports released by the European Union, designed for complex-document understanding tasks. Original queries were created in English and then translated to French, German, Italian, Portuguese and Spanish. The dataset supports various task types including visual-document retrieval, image-to-text and text-to-image, consisting of three main components: corpus, qrels (query relevant documents) and queries, with separate configurations for each language. Statistics show the dataset contains 8,568 test samples with 237,551 characters, average query length of 124.5 characters, and average 5.44 relevant documents per query.

提供机构：

antoineedy

5,000+

优质数据集

54 个

任务类型

进入经典数据集