Crawled URL Index - JISC UK Web Domain Dataset (1996-2013)
收藏DataCite Commons2020-07-23 更新2025-04-09 收录
下载链接:
https://bl.iro.bl.uk/work/3c39a755-5e3d-405b-9944-b13e76a87ad8
下载链接
链接失效反馈官方服务:
资源简介:
The dataset comprises original compound index (CDX) files that have been re-assembled into 18 separate CDX files for each year of crawling activity represented (1996 - 2013). Please note that the individual CDX files are not sorted. In order to enable access to web archives, UKWA uses CDX files to act as indexes so that it is possible to look up which ARC or WARC files contain which URLs and responses. In partnership with the Internet Archive and JISC, UKWA had obtained access to the subset of the Internet Archive’s web collection that relates to the UK. The JISC UK Web Domain Dataset (1996 - 2013) contains all of the resources from the Internet Archive that were hosted on domains ending in ‘.uk’, or that are required in order to render those UK pages. For more information: http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/
提供机构:
British Library
创建时间:
2017-08-30



