five

Wikipedia Articles and Associated WikiProject Templates

收藏
DataCite Commons2025-06-01 更新2024-07-28 收录
下载链接:
https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject_Templates/10248344/4
下载链接
链接失效反馈
官方服务:
资源简介:
== wikiproject_to_template.halfak_20191202.yaml ==<br>The mapping of the canonical names of WikiProjects to all the templates that might be used to tag an article with this WikiProject that was used for generating this dump. For instance, the line 'WikiProject Trade: ["WikiProject Trade", "WikiProject trade", "Wptrade"]' indicates that WikiProject Trade (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Trade) is associated with the following templates:* https://en.wikipedia.org/wiki/Template:WikiProject_Trade* https://en.wikipedia.org/wiki/Template:WikiProject_trade* https://en.wikipedia.org/wiki/Template:Wptrade<br>== wikiproject_taxonomy.halfak_20191202.yaml ==A proposed mapping of WikiProjects to higher-level categories. This mapping has not been applied to the JSON dump contained here. It is based on the WikiProjects' canonical names.<br>== gather_wikiprojects_per_article.py ==Old Python script that built the JSON dump described below for English Wikipedia based on wikitext/wikidata dumps (slow and more prone to errors).<br><br>== gather_wikiprojects_per_article_pageassessments.py ==New Python script to build the JSON dump described below that uses the PageAssessments Mediawiki table in MariaDB and so is much faster and can handle languages beyond Enlgihs much more easily.<br>== labeled_wiki_with_topics_metadata.json.bz2 ==<br>Each line of this bzipped JSON file corresponds with a Wikipedia article in that language (currently Arabic, English, French, Hungarian, Turkish). The intended usage of this JSON file is to build topic classification models for Wikipedia articles.<br>While the English file has good coverage because a more or less complete mapping exists between WikiProjects and topics, the other languages are much more sparse in their labels because they do not cover any WikiProjects in that language that don't have English equivalents (per Wikidata). The other languages are probably best used for supplementation of the English labels or a separate test set that might have a different topic distribution.<br><br>The following properties are recorded:* title: Wikipedia article title in that language<br>* article_revid: Most recent revision ID associated with the article for which a WikiProject asssessment was made (might not be current revision ID)<br>* talk_pid: Page ID corresponding with the talk page for the Wikipedia article* talk_revid: Most recent revision ID associated with the talk page for which a WikiProject asssessment was made (might not be current revision ID)* wp_templates: List of WikiProject templates from the page_assessments table.<br>* qid: Wikidata ID corresponding to the Wikipedia article* sitelinks: Based on Wikidata, the other languages in which this article exists and the corresponding page IDs.<br>* topics: topic labels associated with the article based on its WikiProject templates and the WikiProjectLabel mapping (wikiproject_taxonomy)<br><br>This version is based on the 24 May 2020 page_assessment tables and 4 May 2020 Wikidata item_page_link table. Articles with no associated WikiProject templates are not included. Of note in comparison to previous versions of this file, the revision IDs are now that revision IDs that were most recently assessed by a WikiProject, not the current versions of the page. The sitelinks are now as page IDs, which are more stable and less prone to encoding issues etc. The WikiProject templates are now pulled via the Mediawiki page_assessments table and so are in a different format than the templates that were extracted from the raw talk pages.<br><br>For example, here is the line for Agatha Christie from the English JSON file:<code>{</code><code>'title': 'Agatha_Christie',</code><code>'article_revid': 958377791,</code><code> 'talk_pid': 1001,</code><code> 'talk_revid': 958103309,<br> 'wp_templates': [</code><code>"Women",</code><code>"Women's History",</code><code>"Women writers",</code><code>"Biography",</code><code>"Novels/Crime task force",</code><code>"Novels",</code><code>"Biography/science and academia work group",</code><code>"Biography/arts and entertainment work group",</code><code>"Devon",</code><code>"Archaeology/Women in archaeology task force",</code><code>"Archaeology"],</code><code> 'qid': 'Q35064',</code><code> 'sitelinks': {</code><code> 'afwiki': 19274,</code><code></code><code> 'am</code><code><code>wiki</code>': 47582,<br> 'an</code><code><code>wiki</code>': 115127,<br> 'ar</code><code><code>wiki</code>': 12886,<br></code><code> ...</code><code>'enwiki': 984,<br></code><code>...</code><br><code></code><code></code><code> 'zh</code><code><code>wiki</code>': 10983,<br> 'zh_min_nan</code><code><code>wiki</code>': 21828,<br></code><code> 'zh_yue</code><code><code>wiki</code>': 131652}</code><code>}</code>

== wikiproject_to_template.halfak_20191202.yaml == 本文件为维基项目(WikiProject)规范名称与所有可用于为条目标注该维基项目的模板之间的映射关系,本次导出所使用的模板集即基于此映射生成。例如,条目`WikiProject Trade: ["WikiProject Trade", "WikiProject trade", "Wptrade"]`表明,维基项目贸易(WikiProject Trade,https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Trade)关联以下模板: * https://en.wikipedia.org/wiki/Template:WikiProject_Trade * https://en.wikipedia.org/wiki/Template:WikiProject_trade * https://en.wikipedia.org/wiki/Template:Wptrade == wikiproject_taxonomy.halfak_20191202.yaml == 本文件为维基项目(WikiProject)到高级分类的拟议映射关系,此映射尚未应用于本次导出的JSON文件,其构建基于各维基项目的规范名称。 == gather_wikiprojects_per_article.py == 本Python脚本为旧版实现,用于基于维基文本(wikitext)/维基数据(Wikidata)导出文件为英文维基百科生成下述JSON导出文件,其运行速度较慢且更易出现错误。 == gather_wikiprojects_per_article_pageassessments.py == 本Python脚本为新版实现,通过调用MariaDB中的MediaWiki页面评估(PageAssessments)表来生成下述JSON导出文件,其运行速度显著更快,且可更轻松地处理英文以外的其他语言版本维基百科数据。 == labeled_wiki_with_topics_metadata.json.bz2 == 此BZIP2压缩的JSON文件中,每一行对应目标语言版本的一篇维基百科条目,当前支持的语言包括阿拉伯语、英语、法语、匈牙利语及土耳其语。本JSON文件的设计用途为构建维基百科条目的主题分类模型。 由于英文维基百科的维基项目与主题之间存在近乎完整的映射关系,因此英文版本的标注覆盖度较高;而其他语言版本的标注则相对稀疏,因为它们仅包含维基数据(Wikidata)中存在对应英文维基项目的本地语言维基项目。此类非英文语言的数据集最适合用于补充英文标注集,或作为拥有不同主题分布的独立测试集使用。 本数据集包含以下属性: * `title`:目标语言版本的维基百科条目标题 * `article_revid`:完成维基项目标注的条目的最新修订版ID(可能并非当前最新修订版ID) * `talk_pid`:该条目对应讨论页的页面ID * `talk_revid`:完成维基项目标注的讨论页的最新修订版ID(可能并非当前最新修订版ID) * `wp_templates`:来自页面评估表的维基项目模板列表 * `qid`:该维基百科条目对应的维基数据ID * `sitelinks`:基于维基数据的该条目所存在的其他语言版本及其对应页面ID * `topics`:基于该条目所使用的维基项目模板及维基项目分类映射(wikiproject_taxonomy)生成的条目主题标签 本版本基于2020年5月24日的页面评估表与2020年5月4日的维基数据条目-页面链接表构建。未关联任何维基项目模板的条目不会被纳入本数据集。与本文件的旧版相比,本次修订的修订版ID现已改为维基项目最新完成标注的修订版ID,而非条目当前的最新版本ID;站点链接现已改为页面ID形式,该形式更为稳定且不易出现编码类问题;维基项目模板现已通过MediaWiki页面评估表提取,因此其格式与从原始讨论页提取的模板格式有所不同。 例如,以下为英文JSON文件中关于阿加莎·克里斯蒂(Agatha Christie)的条目示例: json { "title": "Agatha_Christie", "article_revid": 958377791, "talk_pid": 1001, "talk_revid": 958103309, "wp_templates": [ "Women", "Women's History", "Women writers", "Biography", "Novels/Crime task force", "Novels", "Biography/science and academia work group", "Biography/arts and entertainment work group", "Devon", "Archaeology/Women in archaeology task force", "Archaeology" ], "qid": "Q35064", "sitelinks": { "afwiki": 19274, "amwiki": 47582, "anwiki": 115127, "arwiki": 12886, "enwiki": 984, "zhwiki": 10983, "zh_min_nanwiki": 21828, "zh_yuewiki": 131652 } }
提供机构:
figshare
创建时间:
2020-06-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作