wikIR78k, wikIRS78k

Name: wikIR78k, wikIRS78k
Creator: 格勒诺布尔-阿尔卑斯大学
Published: 2020-03-17 17:25:34
License: 暂无描述

arXiv2020-03-17 更新2024-06-21 收录

下载链接：

https://www.zenodo.org/record/3707606, https://www.zenodo.org/record/3707238

下载链接

链接失效反馈

官方服务：

资源简介：

wikIR78k和wikIRS78k是由格勒诺布尔-阿尔卑斯大学的研究团队基于维基百科创建的两个大规模英文信息检索数据集。这两个数据集均包含78,628个查询和超过300万（查询，相关文档）对，旨在解决深度学习模型在信息检索领域因数据量不足而表现不佳的问题。数据集的创建过程涉及从维基百科文章中提取信息，构建查询和文档，并通过特定的算法确定文档与查询的相关性。这些数据集特别适用于训练和评估深度文本匹配模型，尤其是在处理短而精确的查询（wikIR78k）和长而噪声较多的查询（wikIRS78k）时的表现。

wikIR78k and wikIRS78k are two large-scale English information retrieval datasets created by a research team from Grenoble Alpes University based on Wikipedia. Both datasets contain 78,628 queries and over 3 million (query, relevant document) pairs, aiming to address the poor performance of deep learning models in the field of information retrieval due to insufficient data volume. The creation process of the datasets involves extracting information from Wikipedia articles, constructing queries and documents, and determining the relevance between documents and queries through specific algorithms. These datasets are particularly suitable for training and evaluating deep text matching models, especially for assessing their performance when dealing with short and precise queries (wikIR78k) and long and noisy queries (wikIRS78k).

提供机构：

格勒诺布尔-阿尔卑斯大学

创建时间：

2019-12-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集