MIRACL

Name: MIRACL
Creator: WSDM CUP 2023
Published: 2025-09-30T13:42:15+08:00

arXiv2025-09-30 收录

多语言信息检索

自然语言处理

数据链接：

https://project-miracl.github.io/数据链接链接失效反馈

官方服务：

资源简介：

该数据集名为MIRACL，是针对跨多语言搜索领域研究者的独特资源。它涵盖18种不同的语言，并分为四个部分：训练集、开发集、测试集A和测试集B，包含了查询、段落以及相关性判断。此外，该数据集还包括了带有查询和段落的样本，以及人工标注的相关性判断。特别值得一提的是，它还设有“惊喜语言”赛道，引入了训练集中未见的新语言。该数据集规模宏大，拥有超过60万的训练样本对，其任务是进行多语言信息检索。

The dataset named MIRACL is a unique resource for researchers in the cross-lingual information retrieval domain. It covers 18 distinct languages and is divided into four subsets: the training set, development set, test set A, and test set B, which include queries, passages, and relevance judgments. Additionally, the dataset contains samples paired with queries and passages, alongside manually annotated relevance judgments. Notably, it features a "surprise language track" that introduces new languages not present in the training set. With a large scale of over 600,000 training sample pairs, the core task of this dataset is cross-lingual information retrieval.

提供机构：

WSDM CUP 2023

MIRACL

资源简介：

相关数据集