five

Website Classification Dataset - UK Selective Web Archive

收藏
DataCite Commons2020-07-23 更新2025-04-09 收录
下载链接:
https://bl.iro.bl.uk/work/4dcd0215-d8d1-4b95-862e-ed355860b737
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy. In partnership with the Internet Archive and JISC, UKWA had obtained access to the subset of the Internet Archive’s web collection that relates to the UK. The JISC UK Web Domain Dataset (1996 - 2013) contains all of the resources from the Internet Archive that were hosted on domains ending in ‘.uk’, or that are required in order to render those UK pages. UKWA have made this manually-generated classification information available as an open dataset in Tab Separated Values (TSV) format. UKWA is particularly interested in whether high-level metadata like this can be used to train an appropriate automatic classification system so that this manually generated dataset may be used to partially automate the categorisation of the UKWA’s larger archives. UKWA expects that an appropriate classifier might require more information about each site in order to produce reliable results, and a future goal is to augment this dataset with further information. Options include: for each site, making the titles of every page on that site available, and for each site, extract a set of keywords that summarise the site, via the full-text index. For more information: http://data.webarchive.org.uk/opendata/ukwa.ds.1/classification/
提供机构:
British Library
创建时间:
2014-08-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作