MessIRve

Name: MessIRve
Creator: CONICET-UBA. Instituto de Ciencias de la Computación. Buenos Aires, Argentina
Published: 2024-09-10 02:45:04
License: 暂无描述

arXiv2024-09-10 更新2024-09-12 收录

下载链接：

https://huggingface.co/datasets/spanish-ir/google_qrels

下载链接

链接失效反馈

官方服务：

资源简介：

MessIRve是一个大规模的西班牙语信息检索数据集，由阿根廷布宜诺斯艾利斯的CONICET-UBA计算机科学研究所创建。该数据集包含约73万条从Google的自动完成API获取的查询，以及从维基百科中提取的相关文档。数据集的创建过程包括使用Google的自动完成API生成查询，并通过Google搜索的“精选片段”获取相关文档。MessIRve旨在解决西班牙语信息检索领域缺乏高质量评估数据集的问题，推动西班牙语信息检索研究的发展，并改善西班牙语用户的信息访问工具。

MessIRve is a large-scale Spanish-language information retrieval dataset created by the CONICET-UBA Institute of Computer Science in Buenos Aires, Argentina. This dataset contains approximately 730,000 queries obtained from Google's Autocomplete API, alongside relevant documents extracted from Wikipedia. The dataset creation process includes generating queries via Google's Autocomplete API and acquiring relevant documents through Google Search's "Featured Snippets". MessIRve aims to address the shortage of high-quality evaluation datasets in the field of Spanish-language information retrieval, advance research in Spanish information retrieval, and improve information access tools for Spanish-speaking users.

提供机构：

CONICET-UBA. Instituto de Ciencias de la Computación. Buenos Aires, Argentina

创建时间：

2024-09-10

原始信息汇总

Dataset Card for MessIRve

Dataset Details

Dataset Description

Language(s) (NLP): Spanish
License: CC BY-NC 4.0

Dataset Sources

Repository: TBA
Paper: MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Dataset Structure

Data Instances

A typical instance of one subset of the dataset looks like:

json { "id": 4918739, "query": "a cual dedo se pone el anillo de compromiso", "docid": "956254#2", "docid_text": "Pero desde hace cientos de años, se dice que la vena amoris pasa por el dedo anular izquierdo que conecta directamente al corazón (téngase en cuenta que la vena amoris no existe realmente). Tradicionalmente, es ofrecido por el hombre como regalo a su novia mientras o cuando ella accede a la proposición de matrimonio. Representa una aceptación formal del futuro compromiso.", "query_date": "2024-03-30", "answer_date": "2024-04-19", "match_score": 0.74, "expanded_search": false, "answer_type": "feat_snip" }

Data Fields

id: query id
query: query text
docid: relevant document id in the corpus
docid_text: relevant document text
query_date: date the query was extracted
answer_date: date the answer was extracted
match_score: the longest string in the SERP answer that is a substring of the matched document text, as a ratio of the length of the SERP answer
expanded_search: if the SERP returned a message indicating that the search was "expanded" with additional results ("se incluyen resultados de...")
answer_type: type of answer extracted (feat_snippet, featured snippets, are the most important)

Data Splits

The dataset is split into multiple configurations, each corresponding to a different country or a combination of countries. Each configuration has a train and test split.

Configurations

ar: Argentina
bo: Bolivia
cl: Chile
co: Colombia
cr: Costa Rica
cu: Cuba
do: Dominican Republic
ec: Ecuador
es: Spain
full: Full dataset combining all countries
general: General dataset
gt: Guatemala
hn: Honduras
mx: Mexico
ni: Nicaragua
no_country: Queries not specific to any country
pa: Panama
pe: Peru
pr: Puerto Rico
py: Paraguay
sv: El Salvador
us: United States
uy: Uruguay
ve: Venezuela

Split Details

train: Training set
test: Test set

Example Configurations

ar:
- train: 22,261 examples, 12.75 MB
- test: 5,780 examples, 3.35 MB
bo:
- train: 25,015 examples, 14.64 MB
- test: 4,707 examples, 2.77 MB
full:
- train: 571,120 examples, 333.36 MB
- test: 160,099 examples, 95.64 MB

Uses

The dataset is meant to be used to train and evaluate Spanish IR models.

搜集汇总

数据集介绍

构建方式

MessIRve数据集的构建过程始于从Google自动补全API中获取西班牙语查询，这些查询涵盖了20个讲西班牙语的国家和地区，以及美国。研究人员使用了一系列预定义的前缀，如“qué”（什么）、“cómo”（如何）、“dónde”（在哪里）等，来获取以这些前缀开头的流行查询。为了确保数据集的多样性，研究人员迭代地扩展了前缀集合，直到达到预定的结果数量。此外，他们还考虑了不特定于任何国家的查询。相关文档的获取则通过Google搜索的“特色摘要”功能，这些摘要链接到维基百科中的相关条目。为了构建语料库，研究人员使用了2024年4月1日的西班牙维基百科数据，并使用WikiExtractor工具处理这些数据。

特点

MessIRve数据集的特点在于其规模庞大，包含约73万个查询和来自维基百科的相关文档。与其他数据集不同，MessIRve的查询反映了不同西班牙语地区的多样性，而非简单地从英语翻译或忽略方言差异。数据集的规模使其能够涵盖广泛的主题。此外，MessIRve的数据收集过程详细记录，提供了关于如何选择维基百科条目的清晰描述。最后，数据集的评估结果表明，其质量是可接受的，因为大多数查询都被认为是正确的，文档也被认为是相关的。

使用方法

MessIRve数据集可用于开发和评估西班牙语信息检索系统。研究人员已经提供了在MessIRve测试集上进行零样本评估的基线结果，包括BM25、MIRACL-mdpr-es、E5-large和OpenAI-large模型。这些结果可以帮助研究人员了解不同模型的性能，并指导他们开发新的信息检索算法。此外，数据集的公开可用性意味着其他研究人员可以使用它来训练和评估自己的信息检索系统，从而推动西班牙语信息检索领域的研究。

背景与挑战

背景概述

信息检索（IR）是在用户查询的响应中找到相关文档的任务。尽管西班牙语是世界上第二常用的母语，但目前的IR基准缺乏西班牙语数据，阻碍了为西班牙语使用者开发信息访问工具的进展。MessIRve是一个大规模的西班牙语IR数据集，包含来自Google自动完成API的约73万个查询和来自维基百科的相关文档。MessIRve的查询反映了不同的西班牙语区域，而其他数据集要么是从英语翻译的，要么没有考虑方言差异。数据集的庞大尺寸允许它涵盖广泛的主题，而小型数据集则无法做到。我们提供了数据集的全面描述，与现有数据集的比较，以及突出IR模型的基本评估。我们的贡献旨在推进西班牙语IR研究，并改善西班牙语使用者的信息访问。

当前挑战

MessIRve数据集面临的挑战包括：1)数据集构建过程中自动收集查询和文档的挑战，这可能导致数据集中存在偏差或不足；2)数据集中仅包含正相关性判断，缺乏负相关性判断，这对于需要负例子的模型训练是一个限制；3)数据集的主题分布可能与实际西班牙语社区的热门话题分布不完全一致，这可能会影响数据集的代表性。

常用场景

经典使用场景

MessIRve数据集是专为西班牙语信息检索而设计的，其经典使用场景包括但不限于：训练和评估西班牙语信息检索模型，研究不同地区西班牙语用户的查询行为，以及开发适用于西班牙语用户的信息访问工具。由于该数据集包含了来自Google自动补全API的约73万条查询以及来自维基百科的相关文档，因此它能够覆盖广泛的主题，并且反映不同西班牙语地区的多样性。

衍生相关工作

MessIRve数据集的推出，为西班牙语信息检索领域的研究人员提供了一个新的研究工具，并促进了相关研究的发展。例如，研究人员可以利用MessIRve数据集来训练和评估西班牙语信息检索模型，研究不同地区西班牙语用户的查询行为，以及开发适用于西班牙语用户的信息访问工具。此外，该数据集还可以用于开发西班牙语问答系统，为西班牙语用户提供更准确、更相关的答案。

数据集最近研究