five

Reference usage and in-page reuse, all Wikimedia wikis, snapshot 2024-05-01

收藏
DataCite Commons2025-06-01 更新2024-08-19 收录
下载链接:
https://figshare.com/articles/dataset/Reference_usage_and_in-page_reuse_all_Wikimedia_wikis_snapshot_2024-05-01/26003965/1
下载链接
链接失效反馈
官方服务:
资源简介:
OverviewThis data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on usage statistics for reference footnotes made using the Cite extension, across Main-namespace pages (articles) on nearly all Wikimedia sites.  It was produced by processing the Wikimedia Enterprise HTML dumps.Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research.  Our specific goal was to understand the potential for improving the ways in which references can be reused within a page.Reference tags are frequently used in conjunction with wikitext templates, which is challenging .  For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.We didn’t look at reuse across pages for this analysis.LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/The source code is distributed under BSD-3-Clause.Source code and pluggable frameworkThe dumps were processed by HTML dump scraper v0.3.1 written in the Elixir language.The job was run on the Wikimedia Analytics Cluster to take advantage of its high-speed access to HTML dumps.  Production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .Our team plans to continue development of the scraper to support future projects as well.Suggestions for new or improved analysis units are welcomed.Data formatFiles are provided at several levels of granularity, from per-page and per-wiki analysis through all-wikis comparisons.Files are either ND-JSON (newline-delimited JSON), plain JSON or CSV.Column definitionsColumns are documented in metrics.md .Page summariesFine-grained results in which each line represents the summarization of a single wiki page.Example file name: enwiki-20240501-page-summary.ndjson.gzExample metrics found in these files:How many tags are created from templates vs. directly in the article.How many references contain a template transclusion to produce their content.How many references are unnamed, automatically, or manually named.How often references are reused via their name.Copy-pasted references that share the same or almost the same content, on the same page.Whether an article has more than one references list.Wiki summariesPage analyses are rolled up to the wiki level, in a separate file for each wiki.Example file name: enwiki-20240501-summary.jsonTop-level comparisonSummarized statistics for each wiki are collected into a single file.Non-scalar fields are discarded for now and various aggregations are used, as can be seen from aggregated column name suffixes.File name: all-wikis-20240501-summary.csvError count comparisonWe’re also collecting a total count of different Cite errors for each wiki.  File name: all-wikis-20240501-cite-error-summary.csvEnvironmental costsThere were several rounds of experimentation and mistakes, costs below should be multiplied by 3-4.The computation took 4.5 days at 24x vCPU sharing 2 GB of memory at a data center in Virginia, US.  Estimating the environmental impact through https://www.green-algorithms.org/ we get an upper bound of 12.6 kg CO2e, or 40.8 kWh, or 72 km driven in a passenger car.Disk usage was significant as well, with 827 GB read and 4 GB written.  At the high estimate of 7 kWh/GB, this could have used as much as 5.8 MWh of energy, but likely much less since streaming was contained within one data center.
提供机构:
figshare
创建时间:
2024-06-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作