Wikipedia Clickstream
收藏DataCite Commons2020-09-04 更新2024-07-25 收录
下载链接:
https://figshare.com/articles/dataset/Wikipedia_Clickstream/1305770/5
下载链接
链接失效反馈官方服务:
资源简介:
<strong>THIS IS STILL WIP, PLEASE DO NOT CIRCULATE</strong> <strong><br></strong> <strong><br></strong> <strong>About </strong> This dataset contains counts of (referer, article) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included in the request in an HTTP header called the "referer". This data captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. <strong>Data Preparation</strong><br>- The dataset only includes requests to articles in the main namespace of the desktop version of English Wikipedia (see https://en.wikipedia.org/wiki/Wikipedia:Namespace) - Requests to MediaWiki redirects are excluded - Spider traffic was excluded using the ua-parser library (https://github.com/tobie/ua-parser) - Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources of English Wikipedia, based on this scheme:<br> - an article in the main namespace of English Wikipedia -> the article title<br> - any Wikipedia page that is not in the main namespace of English Wikipedia -> 'other-wikipedia'<br> - an empty referer -> 'other-empty'<br> - a page from any other Wikimedia project -> 'other-internal'<br> - Google -> 'other-google'<br> - Yahoo -> 'other-yahoo'<br> - Bing -> 'other-bing'<br> - Facebook -> 'other-facebook'<br> - Twitter -> 'other-twitter'<br> - anything else -> 'other' For the exact mapping see https://github.com/ewulczyn/wmf/blob/master/mc/oozie/hive_query.sql#L30-L48 - (referer, article) pairs with 10 or fewer observations were removed from the dataset Note: When a user requests a page through the search bar, the page the user searched from is listed as a referer. Hence, the data contains '(referer, article)' pairs for which the referer does not contain a link to the article. For an example, consider the '(Wikipedia, Chris_Kyle)' pair. Users went to the 'Wikipedia' article to search for Chris Kyle within English Wikipedia. <strong>Applications</strong><br>This data can be used for various purposes: - determining the most frequent links people click on for a given article<br>- determining the most common links people followed to an article<br>- determining how much of the total traffic to an article clicked on a link in that article<br>- generating a Markov chain over English Wikipedia <strong>Format</strong>:<br>- <strong>prev_id</strong>: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on<br>- <strong>curr_id</strong>: the MediaWiki unique page ID of the article the client requested<br>- <strong>n</strong>: the number of occurrences of the '(referer, article)' pair<br>- <strong>prev_title</strong>: the result of mapping the referer URL to the fixed set of values described above<br>- <strong>curr_title</strong>: the title of the article the client requested <strong>License</strong><br>All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/ <strong>Source code</strong><br>https://github.com/ewulczyn/wmf/blob/master/mc/oozie/hive_query.sql (MIT license)
提供机构:
figshare
创建时间:
2016-01-19



