Wikipedia Article Topics for All Languages (based on article outlinks)
收藏figshare.com2021-07-20 更新2025-03-23 收录
下载链接:
https://figshare.com/articles/dataset/Wikipedia_Article_Topics_for_All_Languages_based_on_article_outlinks_/12619766/3
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains the predicted topic(s) for (almost) every
Wikipedia article across languages. It is missing articles without any
valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The
data is bzip-compressed and each row is tab-delimited and contains the
following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia*
qid: if the article has a Wikidata item, what ID is it -- e.g., the
article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)*
pid: the page ID of the article -- e.g., the article for Douglas Adams
in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)*
num_outlinks: the number of Wikipedia links in the article that were
used by the model to make its prediction -- this is after removing
links to non-article namespaces (e.g., categories, templates), articles
without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly
provided to give a sense of how much data the prediction is based upon.For
more information, see this model description page on Meta:
https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.
本数据集收录了(几乎)所有跨语言维基百科文章的预测主题。其中缺失了那些没有任何有效外链的文章——即指向其他维基百科文章的链接。当前版本基于2020年12月的维基百科数据快照(截至2021年1月1日),但早期或未来的版本可能基于其他数据快照,具体取决于文件名。数据以bzip格式压缩,每行以制表符分隔,包含以下元数据,随后是预测概率(四舍五入至小数点后三位以减小文件大小),以及这些主题应用于文章的概率:[https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy]。具体包括:* wiki_db:文章所属的维基百科语言版本——例如,enwiki代表英文维基百科;* qid:如果文章有维基数据条目,则为其ID——例如,道格拉斯·亚当斯的文章ID为Q42(https://www.wikidata.org/wiki/Q42);* pid:文章的页面ID——例如,道格拉斯·亚当斯在英文维基百科中的文章ID为8091(en.wikipedia.org/wiki/?curid=8091);* num_outlinks:模型用于预测的维基百科链接数量——这是在去除指向非文章命名空间(例如,分类、模板)的链接、没有维基数据ID的文章(数量极少)以及跨语言链接之后的结果——即仅保留指向具有相关维基数据ID的同一维基中命名空间0文章的链接。这主要提供了一种对预测所依据数据量的感觉。更多信息,请参阅Meta上的此模型描述页面:https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performance。此外,提供了一个1%的样本文件,以便于探索。采样是通过维基数据ID进行的,如果例如Q800612(Canfranc国际火车站)被采样,则包括该文章的16种语言版本。它包含201,196个维基数据ID,导致340,290篇文章。
提供机构:
figshare



