Psycholinguistic LIWC and n-gram counts in a corpus of 1145 English and Dutch novels
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/tmp32v54ss
下载链接
链接失效反馈官方服务:
资源简介:
This dataset consists of CSV files with word counts in several corpora:
- 694 English language novels from male and female authors classified by authors' sexual orientation (heterosexual, bisexual, homosexual)
- 401 bestselling Dutch language novels
- 50 novels nominated for Dutch literary prizes
Each corpus comes with:
- LIWC counts; this file also includes the available metadata for each novel.
The English data was created with LIWC 2015. The Dutch data was created with
the validated translation of LIWC 2001.
- Word counts (unigrams) and bigram counts per novel.
All text has been converted to lowercase.
Contractions are tokenized into separate tokens, e.g., can't => ca n't
Two restrictions are applied:
- only unigrams or bigrams that occur in at least 10 texts are retained
- only the 100k most frequent are retained
- Overall word counts and bigram counts; i.e., the sum across all novels.
All files are encoded in UTF-8.
The word counts were extracted with the countngrams.py script.
创建时间:
2020-11-04



