five

Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4468782
下载链接
链接失效反馈
官方服务:
资源简介:
Three corpora in different domains extracted from Wikipedia. For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections. The article structure, and particularly the sub-titles and paragraphs are kept in these datasets   Wines Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are  Dom Pérignon - Moët & Chandon Pinot Meunier - Chardonnay Movies The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are  Schindler's List - The Pianist Lion King - The Jungle Book Video games The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are: Grand Theft Auto - Mafia Burnout Paradise - Forza Horizon 3
创建时间:
2021-01-27
二维码
社区交流群
二维码
科研交流群
商业服务