Wikipedia Vectors

Figshare2016-04-23 更新2026-04-08 收录

下载链接：

https://figshare.com/articles/dataset/Wikipedia_Vectors/3146878/1

下载链接

链接失效反馈

官方服务：

资源简介：

In this project, we learned embeddings for Wikipedia articles and Wikidata items by applying Word2vec models to a corpus of reading sessions.Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the cosine similarity, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles.There have been several approaches to learning vector representations of Wikipedia articles that capture semantic similarity by using the article text or the links between articles. An advantage of training Word2vec models on reading sessions, is that they learn from the actions of millions of humans who are using a diverse array of signals, including the article text, links, third-party search engines, and their existing domain knowledge, to determine what to read next in order to learn about a topic.An additional feature of not relying on text or links, is that we can learn representations for Wikidata items by simply mapping article titles within each session to Wikidata items using Wikidata sitelinks. As a result, these Wikidata vectors are jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe. This approach also overcomes data sparsity issues for smaller Wikipedias, since the representations for articles in smaller Wikipedias are shared across many other potentially larger ones. Finally, instead of needing to generate a separate embedding for each Wikipedia in each language, we have a single model that gives a vector representation for any article in any language, provided the article has been mapped to a Wikidata item.For detailed documentation, see the wiki page.

本项目针对维基百科（Wikipedia）条目与维基数据（Wikidata）项，通过将Word2vec模型应用于阅读会话语料库来学习其嵌入表示。尽管Word2vec模型最初旨在从语句语料库中学习词嵌入，但该模型可适用于任意类型的序列数据。所学习得到的嵌入具备如下特性：训练语料库中邻域相似的条目，其表征也具有相似性（例如可通过余弦相似度进行度量）。因此，将Word2vec应用于阅读会话可得到文章嵌入：那些常被连续阅读的文章，其表征也较为相似。由于人们在阅读时通常会生成语义相关的文章序列，因此此类嵌入也能捕捉文章间的语义相似性。此前已有多种方法可通过利用文章文本或文章间的链接，学习能够捕捉语义相似性的维基百科文章向量表征。基于阅读会话训练Word2vec模型的一项优势在于，其可从数百万用户的行为中学习——这些用户会借助多样的信号来确定后续阅读内容以了解某一主题，包括文章文本、链接、第三方搜索引擎以及自身已有的领域知识。该方法无需依赖文本或链接，另一项优势在于：仅需借助维基数据站点链接（Wikidata sitelinks），将每个阅读会话中的文章标题映射至维基数据项，即可为维基数据项学习向量表征。由此，这些维基数据向量可基于所有语言版本维基百科的阅读会话进行联合训练，使模型能够汲取全球用户的行为数据。该方法还可解决小型维基百科的数据稀疏性问题，因为小型维基百科中文章的向量表征可与众多其他规模更大的维基百科共享。最后，无需为每种语言的维基百科单独生成嵌入，我们仅需单个模型，即可为任意语言的维基百科文章生成向量表征——前提是该文章已被映射至维基数据项。详细文档请参阅该维基页面。

创建时间：

2016-04-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集