five

TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/789289
下载链接
链接失效反馈
官方服务:
资源简介:
Fixes in version 1.1 (= Zenodo's "version 2") *In 20161101-revisions-part1-12-1728.csv, missing first data line is added. *In Current_content and Deleted_content files, some token values ('str' column) which contain regular quotes ('"') are fixed. *In Current_content and Deleted_content files, some wrong revision ID values for 'origin_rev_id', 'in' and 'out' columns are fixed.  ------ This dataset contains every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history. This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task.  Adapting a state-of-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage. This dataset, its creation process and use cases are described in a dedicated dataset paper of the same name, published at the ICWSM 2017 conference. In this paper, we show how this data enables, on token level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and fine-grained interactions of editors like partial reverts, re-additions and other metrics. Tokenization used: https://gist.github.com/faflo/3f5f30b1224c38b1836d63fa05d1ac94 Toy example for how the token metadata is generated:  https://gist.github.com/faflo/8bd212e81e594676f8d002b175b79de8 Be sure to read the ReadMe.txt or - even more detailed - the supporting paper which is referenced under "related identifiers".
创建时间:
2020-01-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作