External References of English Wikipedia (ref-wiki-en)
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4001138
下载链接
链接失效反馈官方服务:
资源简介:
External References of English Wikipedia (ref-wiki-en) is a corpus of the plain-text content of 2,475,461 external webpages linked from the reference section of articles in English Wikipedia. Specifically:
32,329,989 external reference URLs were extracted from a 2018 HTML dump of English Wikipedia. Removing repeated and ill-formed URLs yielded 23,036,318 unique URLs.
These URLs were filtered to remove file extensions for unsupported formats (videos, audio, etc.), yielding 17,781,974 downloadable URLs. The URLs were loaded into Apache Nutch and continuously downloaded from August 2019 to December 2019, resulting in 2,475,461 successfully downloaded URLs. Not all URLs could be accessed. The order in which URLs were accessed was determined by Nutch, which partitions URLs by host and then randomly chooses amongst the URLs for each host.
The content of these webpages were indexed in Apache Solr by Nutch. From Solr we extracted a JSON dump of the content.
Many URLs offer a redirect; unfortunately Nutch does not index redirect information. This means that connecting the Wikipedia article (with the pre-direct link) to the downloaded webpage (at the post-redirect link) was complicated. However, by inspecting the order of download in the Nutch log files, we managed to recover links for 2,058,896 documents (83%) from their original Wikipedia article(s).
We further managed to associate 3,899,953 unique Wikidata items with at least one external reference webpage in the corpus.
The ref-en-wiki corpus is incomplete, i.e., we did not attempt to download all reference URLs for English Wikipedia. We thus also collect a smaller complete corpus for the external references of 5,000 Wikipedia articles (ref-wiki-en-5k). We sampled from 5 ranges of Wikidata items: Q1-10000, Q10001-100000, Q100001-1000000, Q1000001-10000000, and Q10000001-100000000. From each range we sampled 1000 items. We then scraped the external reference URLs for the Wikipedia article corresponding to these items and downloaded them. The resulting corpus contains 37,983 webpages.
Each line of the corpus (ref-wiki-en, ref-wiki-en-5k) encodes the webpage of an external reference in JSON format. Specifically, we provide:
tstamp: When the webpage was accessed
host: The domain (FQDN post-redirect) from which the webpage was retrieved.
title: The title (meta) of the document
url: The URL (post-redirect) of the webpage
Q: The Q-code identifiers of the Wikidata items whose corresponding Wikipedia article is confirmed to link to this webpage.
content: A plain-text encoding of the content of the webpage.
Below we provide an abbreviated example of a line from the corpus:
{"tstamp":"2019-09-26T01:22:43.621Z","host":"geology.isu.edu","title":"Digital Geology of Idaho - Basin And Range","url":"http://geology.isu.edu/Digital_Geology_Idaho/Module9/mod9.htm","Q":[810178],"content":"Digital Geology of Idaho - Basin And Range\n1 - Idaho Basement Rock\n2 - Belt Supergroup\n3 - Rifting & Passive Margin\n4 - Accreted Terranes\n5 - Thrust Belt\n6 - Idaho Batholith\n7 - North Idaho & Mining\n8 - Challis Volcanics\n9 - Basin and Range\n10 - Columbia River Basalts\n11 - SRP & Yellowstone\n12 - Pleistocene Glaciation\n13 - Palouse & Lake Missoula\n14 - Lake Bonneville Flood\n15 - Snake River Plain Aquifer\nBasin and Range Province - Teritiary Extension\nGeneral geology of the Basin and Range Province\nMechanisms of Basin and Range faulting\nIdaho Basin and Range south of the Snake River Plain\nIdaho Basin and Range north of the Snake River Plain\nLocal areas of active and recent Basin & Range faulting: Borah Peak\nPDF Slideshows: North of SRP , South of SRP , Borah Earthquake\nFlythroughs: Teton Valley , Henry's Fork , Big Lost River , Blackfoot , Portneuf , Raft River Valley , Bear River , Salmon Falls Creek , Snake River , Big Wood River\nVocabulary Words\nthrust fault\nBasin and Range\nSnake River Plain\nhalf-graben\ntransfer zone\n \n \n \n \nFly-throughs\nGeneral geology of the Basin and Range Province\nThe Basin and Range Province generally includes most of eastern California, eastern Oregon, eastern Washington, Nevada, western Utah, southern and western Arizona, and southeastern Idaho. ..."},
A summary of the files we make available:
ref-wiki-en.json.gz: 2,475,461 external reference webpages (JSON format)
ref-wiki-en_urls.txt.gz: 23,036,318 unique raw links to external references (plain-text format)
ref-wiki-en-5k.json.gz: 37,983 external reference webpages (JSON format)
ref-wiki-en-5k_urls.json.gz: 70,375 unique raw links to external references (plain-text format)
ref-wiki-en-5k_Q.txt.gz: 5,000 Wikidata Q identifiers forming the 5k dataset (plain-text format)
Further details can be found in the publication:
Suggesting References for Wikidata Claims based on Wikipedia's External References. Paolo Curotto, Aidan Hogan. Wikidata Workshop @ISWC 2020.
Further material relating to this publication (including code for a proof-of-concept interface) is also available.
创建时间:
2021-02-19



