HTRC Word Frequencies in English-Language Literature, 1700-1922

SSH Open MarketPlace2021-07-22 更新2024-08-03 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/OByQui

下载链接

链接失效反馈

官方服务：

资源简介：

Many of the questions scholars want to ask about large collections of text can be posed using simplified representations – for instance, a list of the words in each volume, together with their frequencies. This dataset represents a first attempt to provide that information for English-language fiction, drama, and poetry published between 1700 and 1922, and contained in the HathiTrust Digital Library. The project combines two sources of information. The word counts themselves come from the HathiTrust Research Center (HTRC), which has tabulated them at the page level in 4.8 million public-domain volumes. Information about genre comes from a parallel project led by Ted Underwood, and supported by the National Endowment for the Humanities and the American Council of Learned Societies. This project applied machine learning to recognize genre at the page level in 854,476 English-language volumes. Mapping genre at the page level is important because genres are almost always mixed within volumes. Volumes of poetry can have long nonfiction introductions; volumes of fiction can be followed by many pages of publishers' advertisements. Fortunately, text categories of this broad kind (fiction/nonfiction/poetry/drama/paratext) can be identified fairly accurately by statistical models.

创建时间：

2021-07-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集