institutional/institutional-books-1.0-metadata
收藏Hugging Face2026-04-24 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/institutional/institutional-books-1.0-metadata
下载链接
链接失效反馈官方服务:
资源简介:
Institutional Books是一个不断增长的公共领域书籍语料库。1.0版本的数据集包含了983,004本公共领域书籍,主要出版于19世纪和20世纪。数据集提供了广泛的卷级元数据,包括原始和生成的组件。此外,数据集还经过了集合级别的去重、OCR工件和文本分析、以及OCR文本后处理等精炼过程。数据集的来源是哈佛图书馆参与谷歌图书项目并进行数字化处理的结果。
Institutional Books is a growing corpus of public domain books. The 1.0 version of the dataset includes 983,004 public domain books, primarily published in the 19th and 20th centuries. The dataset provides extensive volume-level metadata, including both original and generated components. Additionally, the dataset has undergone collection-level deduplication, OCR artifact and text analysis, and OCR text post-processing. The source of the dataset is the digitization of books as part of Harvard Librarys participation in the Google Books project.
提供机构:
institutional



