Data for: Psycholinguistic dataset on language use in 1145 novels published in English and Dutch

Mendeley Data2024-06-25 更新2024-06-27 收录

下载链接：

https://data.mendeley.com/datasets/x3m2gjkhx5

下载链接

链接失效反馈

官方服务：

资源简介：

LIWC and n-gram counts of English and Dutch novels ================================================== This dataset consists of CSV files with word counts in several corpora: - 694 English language novels from different genders and orientations - 401 bestselling Dutch language novels - 50 novels nominated for Dutch literary prizes Each corpus comes with: - LIWC counts; this file also includes the available metadata for each novel. The English data was created with LIWC 2015. The Dutch data was created with the validated translation of LIWC 2001. - Word counts (unigrams) and bigram counts per novel. All text has been converted to lowercase. Contractions are tokenized into separate tokens, e.g., can't => ca n't Two restrictions are applied: - only unigrams or bigrams that occur in at least 10 texts are retained - only the 100k most frequent are retained - Overall word counts and bigram counts; i.e., the sum across all novels. All files are encoded in UTF-8.

英语与荷兰语小说的LIWC及n-gram词频统计数据集 ================================================== 本数据集包含多个语料库的词频统计CSV文件： - 694部由不同性别与性取向创作者创作的英语小说 - 401部畅销荷兰语小说 - 50部获荷兰文学奖提名的小说每个语料库均包含以下内容： - 语言调查与词计数工具（LIWC）词频统计文件；该文件同时收录每部小说的可用元数据。其中英语数据集基于LIWC 2015生成，荷兰语数据集则基于经验证的LIWC 2001译本生成。 - 每部小说的单字（unigrams）词频与二元组（bigrams）词频统计结果。所有文本均已转换为小写形式，缩略词会被拆分为独立的Token，例如"can't"将被拆分为"ca n't"。本次统计应用了两项筛选规则： - 仅保留至少在10部小说中出现过的单字或二元组 - 仅保留出现频率最高的前100,000个单字/二元组此外还提供总词频与总二元组词频，即所有小说的词频累加结果。所有文件均采用UTF-8编码。

创建时间：

2024-01-23