spanish-ir/eswiki_20240401_corpus

Name: spanish-ir/eswiki_20240401_corpus
Creator: spanish-ir
Published: 2024-09-11 13:39:34
License: 暂无描述

Hugging Face2024-09-11 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/spanish-ir/eswiki_20240401_corpus

下载链接

链接失效反馈

官方服务：

资源简介：

MessIRve (corpus)数据集是一个基于2024年4月1日西班牙语维基百科转储文件的大规模信息检索数据集。每个文档对应维基百科中的一个段落，并提供了文章标题作为检索时的额外上下文。数据集的字段包括docid（文档ID）、title（文章标题）和text（段落文本）。docid遵循X#Y的模式，其中X表示文章ID，Y表示段落在文章中的顺序。

MessIRve is a Spanish Wikipedia corpus where each document corresponds to a paragraph in Wikipedia, with the article title provided as additional context. The dataset includes three fields: docid (document ID), title (article title), and text (document text). The dataset has a download size of 3.1GB and a total size of 5.47GB, containing 14,047,759 samples. It was created using the WikiExtractor tool to process the Spanish Wikipedia dump from April 1, 2024.

提供机构：

spanish-ir

5,000+

优质数据集

54 个

任务类型

进入经典数据集