Wikipedia Article Topics for All Languages (based on article outlinks)
收藏DataCite Commons2021-07-20 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/dataset/Wikipedia_Article_Topics_for_All_Languages_based_on_article_outlinks_/12619766/2
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles.<br><br>The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy<br><br>* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia<br>* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)<br>* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)<br>* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.<br><br>For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performance<br>
提供机构:
figshare
创建时间:
2021-02-03



