Jane Austen's Novel Corpus analysis using LancsBox X in clusters ranging from 1 to 17 words

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://doi.org/10.7910/DVN/EXUWCK

下载链接

链接失效反馈

官方服务：

资源简介：

This literary corpus’ analysis offers the following data regarding six of Jane Austen’s novels: Ranging from one word to seventeen words, the frequency of appearance of a particular word or set of words within the corpus, the Relative frequency of appearance of a particular word or set of words within the corpus, The Average Reduced Frequency (that is, the frequency minus near repetitions of a particular word or set of words within the corpus), the Range of appearance (that is, in how many of the novels does a particular word or set of words appear within the corpus), the Range Percentage, the Coefficient of Variation of a particular word or set of words within the corpus, the Juilland’s D value, or spread evenness of a particular word or set of words within the corpus, and the Deviation of Proportions or the unevenness from a proportional appearance of a particular word or set of words within the corpus. The Corpus that is here analyzed comprises six of Jane Austen’s Novels: Sense and Sensibility (1811), Pride and Prejudice (1813), Mansfield Park (1814), Emma (1815), Persuasion (1817), and Northanger Abbey (1817). Intentionally, this dataset leaves behind both Sanditon (unfinished, 1817), and Juvenilia (1787-1793), the former because it was not finished by the author and the latter because it does not conform a novel. All of the novels have been extracted from Project Gutenberg. The software used for this analysis is Lancsbox X, developed by Lancaster University as a tool for linguistic corpora analysis. The data offered here is, save correcting a few typographic exceptions made by the software, the raw data obtained from the software, so that researchers can use this data for their purposes without any type of bias. This data can be used, for instance, as a way of understanding how the novels convey the notion of self, or in how many of the novels does a particular cluster of words appear twice or more. For instance, the appearance of the same poem in both Norhanger Abbey and Emma, or that many of these repetitions between novels apply specifically, to Northanger Abbey and the rest of the novels, thus offering a perspective in which Northanger Abbey (which, despite its late publication, was written, if we attend to the author within the novel, in 1803) worked partly as a prototype to what was to come, at least linguistically. In short, this data provides a linguistic analysis of a literary corpus that can be easily applied to different researches. Note that, in several cases, there are data which reference the same cluster, one node behind or forward. It has been decided to keep the data, should the data be used to study the performance of the software, in addition to the fact that researchers can easily filter the information they require. Furthermore, the word "chapter" has been taken from the corpus, save those that appear in the body of the novels. Not the same has been done with the word "end", since some of the novels end with the word Finis (3), researchers should acknowledge that out of the 277 instances in which the word "end" appears, 3 are the final words of Emma, Persuasion, and Mansfield Park. V2. A second document has been added that explains an analysis of the data as will be futurely used for an article. This document explores two sets of words, those related to identity and those related to gender. It also explores the relevance of different clusters, such as the cluster "I am sure I do not know", solely expressed by women within the novels, or two other clusters that seemed interesting in terms of gender or due to their strangeness (Such as the courtship riddle in Emma).

创建时间：

2025-07-22