Page-Level Genre Metadata for English-Language Volumes in HathiTrust, 1700-1922
收藏figshare.com2023-05-30 更新2025-03-26 收录
下载链接:
https://figshare.com/articles/dataset/Page_Level_Genre_Metadata_for_English_Language_Volumes_in_HathiTrust_1700_1922/1279201/1
下载链接
链接失效反馈官方服务:
资源简介:
Page-by-page genre predictions for 854,476 English-language volumes printed between 1700 and 1922, keyed to the texts in HathiTrust Digital Library. This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies.
The genre predictions were produced by an ensemble of regularized logistic classifiers, and are intended to support research that explores broad trends in literary history. Since volumes usually contain multiple genres, page-level metadata is necessary to create machine-readable collections in a particular genre.
Only very broad categories are discriminated here (fiction, poetry, drama, nonfiction prose, paratext). Overall average accuracy is 93.6%, but confidence metrics are included that allow researchers to trade recall for enhanced precision. For instance, the filtered subsets of fiction, poetry, and drama (included here as fiction.tar.gz, etc.) have higher than 97% precision.
Predictions are included as JSON objects in separate files, one for each volume. The tar.gz files prefixed with "all" include all 854,476 volumes, divided by date. The tar.gz files named for genres contain subsets of volumes that have been filtered to achieve greater than 97% precision in that particular genre. Specifically, they include 18,111 vols containing drama, 102,349 vols containing fiction, and 61,286 vols containing poetry. These datasets were filtered both with confidence metrics created by a logistic model and by manual editing. Ringers.csv is a list of volumes that we had to manually remove; scholars who select their own datasets from the larger collection (of files beginning "all") may also want to consider filtering out these tricky cases.
Accompanying meta.csv files provide summary volume-level metadata for each collection. For full details of methods and data format, see the interim project report at (http://dx.doi.org/10.6084/m9.figshare.1281251). For software and training data used in the project, see the repository (https://github.com/tedunderwood/genre).
本数据集包含了1700年至1922年间印刷的854,476部英语文献,按HathiTrust数字图书馆中的文本进行分类,旨在预测每一页的文学体裁。该研究得到了美国国家人文基金会和美国学术学会的支持。体裁预测由一系列正则化逻辑分类器生成,旨在支持探索文学史上的广泛趋势的研究。由于卷册通常包含多种体裁,因此需要页级元数据以创建特定体裁的机器可读集合。此处仅区分非常宽泛的类别(小说、诗歌、戏剧、非小说散文、副文本)。总体平均准确率达到了93.6%,同时提供了置信度指标,使得研究人员可以在召回率和增强的精确度之间进行权衡。例如,包括在此处的小说、诗歌和戏剧的过滤子集(如fiction.tar.gz等)具有超过97%的精确度。预测结果以JSON对象的形式包含在单独的文件中,每个卷册一个。以“all”为前缀的tar.gz文件包含了全部854,476卷册,按日期划分。以体裁命名的tar.gz文件包含了过滤后的卷册子集,以达到特定体裁超过97%的精确度。具体而言,它们包括包含戏剧的18,111卷、包含小说的102,349卷和包含诗歌的61,286卷。这些数据集既通过逻辑模型创建的置信度指标进行了过滤,也通过人工编辑进行了过滤。Ringers.csv列出了我们不得不手动移除的卷册;学者在选择自大型集合(以“all”开头的文件)中的数据集时,也可能希望考虑过滤掉这些棘手的案例。伴随的meta.csv文件提供了每个集合的卷册级元数据的概要。关于方法和数据格式的详细信息,请参阅临时项目报告(http://dx.doi.org/10.6084/m9.figshare.1281251)。关于项目中使用的软件和训练数据,请参阅存储库(https://github.com/tedunderwood/genre)。
提供机构:
figshare.com



