Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/4468782

下载链接

链接失效反馈

官方服务：

资源简介：

Three corpora in different domains extracted from Wikipedia. For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections. The article structure, and particularly the sub-titles and paragraphs are kept in these datasets Wines Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are Dom Pérignon - Moët & Chandon Pinot Meunier - Chardonnay Movies The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are Schindler's List - The Pianist Lion King - The Jungle Book Video games The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia Burnout Paradise - Forza Horizon 3

创建时间：

2021-01-27