contextlab/fitzgerald-corpus
收藏Hugging Face2025-10-28 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/contextlab/fitzgerald-corpus
下载链接
链接失效反馈官方服务:
资源简介:
这个数据集包含了F. Scott Fitzgerald的作品,经过清理和预处理,用于计算风格学研究。数据集包括8本书,总共大约有592,393个单词,格式为小写的纯文本文件。预处理步骤包括移除页眉、页脚、章节标题和非叙事文本,以及编码规范化。该数据集适用于风格学研究、语言模型训练、文学分析和历史自然语言处理等应用。需要注意的是,数据集存在一些限制,例如使用的语言是历史语言、仅包含小写字母,以及仅包括公有领域作品。
This dataset contains works by F. Scott Fitzgerald, cleaned and preprocessed for computational stylometry research. It includes 8 books totaling approximately 592,393 words, formatted in lowercase plain text files. The preprocessing involves removing headers, footers, chapter headings, and non-narrative text, as well as normalizing encoding. The dataset is intended for stylometry research, language modeling, literary analysis, and historical NLP applications. It is important to note the limitations, such as the use of historical language, lowercase only, and the inclusion of public domain works.
提供机构:
contextlab



