five

Materials for 2d representation of the HathiTrust Library

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/1477017
下载链接
链接失效反馈
官方服务:
资源简介:
Materials to create the LargeVis visualization online at http://creatingdata.us/datasets/hathi-features/, and described in Benjamin Schmidt, "Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries," Journal of Cultural Analytics. October 3, 2018. Two items. First, `hathi_pca.bin`: a binary file with 100-dimensional representations of the complete Hathi Trust Extended Features set. These began as 1280-dimensional SRP features, and were reduced to 100 dimensions using a PCA transformation matrix derived using a random sample of the full 13 million book set. Vectors were reduced to unit length before PCA, but not afterwords; this means that in general, their length gives some sense of much information was lost in the PCA representation. This can be read using the code at https://github.com/bmschmidt/pySRP, or anything that reads word2vec formatted vectors. Includes HathiTrust identifiers. Second, `hathi.tsv.gz`: a row oriented set containing a variety of metadata fields for each set, including (as 'x' and 'y') the coordinates of a 2-d LargeVis visualization. This is the immediate input to the visualization at ttp://creatingdata.us/datasets/hathi-features/. Columns should be relatively straightforward; they are derived from the HathiTrust MARC records, which can be accessed through Hathi's public API. Classification codes ('lc1') are using the Library of Congress classification; they represent the subclass (generally two characters, though it can be one or three). The first character alone represents the LC class and can be useful for coloring high-level overviews. These two files can be merged through the Hathi Trust identifier present in both.
创建时间:
2020-01-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作