Reference usage and in-page reuse, all Wikimedia wikis, snapshot 2024-05-01
收藏DataCite Commons2024-06-10 更新2024-08-19 收录
下载链接:
https://figshare.com/articles/dataset/Reference_usage_and_in-page_reuse_all_Wikimedia_wikis_snapshot_2024-05-01/26003965
下载链接
链接失效反馈官方服务:
资源简介:
OverviewThis data was produced by Wikimedia Germany’s Technical Wishes team, and focuses on usage statistics for reference footnotes made using the Cite extension, across Main-namespace pages (articles) on nearly all Wikimedia sites. It was produced by processing the Wikimedia Enterprise HTML dumps.Our analysis of references was inspired by "Characterizing Wikipedia Citation Usage” and other research. Our specific goal was to understand the potential for improving the ways in which references can be reused within a page.Reference tags are frequently used in conjunction with wikitext templates, which is challenging . For this reason, we decided to parse the rendered HTML pages rather than the original wikitext.We didn’t look at reuse across pages for this analysis.LicenseAll files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/The source code is distributed under BSD-3-Clause.Source code and pluggable frameworkThe dumps were processed by HTML dump scraper v0.3.1 written in the Elixir language.The job was run on the Wikimedia Analytics Cluster to take advantage of its high-speed access to HTML dumps. Production configuration is included in the source code repository, and the commandline used to run was: “MIX_ENV=prod mix run pipeline.exs” .Our team plans to continue development of the scraper to support future projects as well.Suggestions for new or improved analysis units are welcomed.Data formatFiles are provided at several levels of granularity, from per-page and per-wiki analysis through all-wikis comparisons.Files are either ND-JSON (newline-delimited JSON), plain JSON or CSV.Column definitionsColumns are documented in metrics.md .Page summariesFine-grained results in which each line represents the summarization of a single wiki page.Example file name: enwiki-20240501-page-summary.ndjson.gzExample metrics found in these files:How many tags are created from templates vs. directly in the article.How many references contain a template transclusion to produce their content.How many references are unnamed, automatically, or manually named.How often references are reused via their name.Copy-pasted references that share the same or almost the same content, on the same page.Whether an article has more than one references list.Wiki summariesPage analyses are rolled up to the wiki level, in a separate file for each wiki.Example file name: enwiki-20240501-summary.jsonTop-level comparisonSummarized statistics for each wiki are collected into a single file.Non-scalar fields are discarded for now and various aggregations are used, as can be seen from aggregated column name suffixes.File name: all-wikis-20240501-summary.csvError count comparisonWe’re also collecting a total count of different Cite errors for each wiki. File name: all-wikis-20240501-cite-error-summary.csvEnvironmental costsThere were several rounds of experimentation and mistakes, costs below should be multiplied by 3-4.The computation took 4.5 days at 24x vCPU sharing 2 GB of memory at a data center in Virginia, US. Estimating the environmental impact through https://www.green-algorithms.org/ we get an upper bound of 12.6 kg CO2e, or 40.8 kWh, or 72 km driven in a passenger car.Disk usage was significant as well, with 827 GB read and 4 GB written. At the high estimate of 7 kWh/GB, this could have used as much as 5.8 MWh of energy, but likely much less since streaming was contained within one data center.
### 数据集概述
本数据集由维基媒体德国分会(Wikimedia Germany)的技术愿景团队(Technical Wishes team)制作,聚焦于几乎所有维基媒体站点主命名空间页面(即百科文章,Main-namespace pages)中使用引用扩展(Cite extension)生成的引用脚注使用统计数据。该数据集通过处理维基媒体企业级HTML转储文件(Wikimedia Enterprise HTML dumps)生成。
我们针对引用的分析灵感源自《维基百科引用使用特征分析》(Characterizing Wikipedia Citation Usage)及其他相关研究,核心目标是探究优化页面内引用复用方式的可行性。
引用标签常与维基文本模板(wikitext templates)结合使用,这一场景存在较大解析挑战。为此,我们选择解析渲染后的HTML页面(rendered HTML pages)而非原始维基文本。本次分析未涉及跨页面的引用复用场景。
### 许可证
本数据集包含的所有文件均采用知识共享零协议(CC0)发布,详情参见:https://creativecommons.org/publicdomain/zero/1.0/
其源代码采用BSD 3条款许可证(BSD-3-Clause)分发。
### 源代码与可插拔框架
本次转储文件的处理由采用Elixir编程语言编写的HTML转储爬取工具v0.3.1(HTML dump scraper v0.3.1)完成。作业运行于维基媒体分析集群(Wikimedia Analytics Cluster),以利用其对HTML转储文件的高速访问能力。源代码仓库中包含生产环境配置,实际运行使用的命令行参数为:`"MIX_ENV=prod mix run pipeline.exs"`。
我们团队计划持续迭代该爬取工具,以支持未来的相关项目。同时,我们欢迎针对新增或优化分析单元的相关建议。
### 数据格式
数据集提供多个粒度层级的文件,涵盖单页面与单维基分析至全维基对比分析。文件格式包含换行分隔JSON(ND-JSON,newline-delimited JSON)、纯JSON以及逗号分隔值(CSV)三种。
### 字段定义
各字段的定义详见metrics.md文档。
### 页面摘要
细粒度分析结果:每行数据对应单个维基百科页面的统计摘要。
示例文件名:enwiki-20240501-page-summary.ndjson.gz
此类文件中包含的统计指标示例:
- 由模板生成与直接在文章中编写的引用标签数量之比
- 包含模板嵌入以生成内容的引用数量
- 未命名、自动命名与手动命名的引用数量
- 通过引用名称复用引用的频率
- 同一页面内内容完全或近似一致的复制粘贴型引用数量
- 文章是否包含多个引用列表
### 维基站点摘要
页面分析结果将汇总至维基站点层级,每个维基站点对应一个单独的汇总文件。
示例文件名:enwiki-20240501-summary.json
### 全局对比
各维基站点的汇总统计数据将整合为单个文件。目前非标量字段已被剔除,并采用各类聚合方式进行统计,具体可从聚合后的字段名后缀看出。
文件名:all-wikis-20240501-summary.csv
### 错误计数对比
我们同时收集了每个维基站点的各类Cite扩展错误总计数。
文件名:all-wikis-20240501-cite-error-summary.csv
### 环境成本
本次计算经历多轮实验与试错,实际成本应乘以3-4倍进行估算。
计算作业在美国弗吉尼亚州的一处数据中心运行,耗时4.5天,使用24个共享2GB内存的虚拟中央处理器(vCPU)。通过https://www.green-algorithms.org/估算环境影响,得到的上限值为12.6千克二氧化碳当量(CO2e),或40.8千瓦时(kWh),等效于乘用车行驶72公里的碳排放。
磁盘使用量同样可观:读取数据达827GB,写入数据达4GB。若按每GB7千瓦时的较高估算值计算,这将消耗多达5.8兆瓦时的电能,但由于数据流转均在同一数据中心内完成,实际能耗可能远低于该数值。
提供机构:
figshare
创建时间:
2024-06-10



