five

Psycholinguistic LIWC and n-gram counts in a corpus of 1145 English and Dutch novels

收藏
Mendeley Data2024-03-27 更新2024-06-26 收录
下载链接:
https://data.mendeley.com/datasets/tmp32v54ss
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset consists of CSV files with word counts in several corpora: - 694 English language novels from male and female authors classified by authors' sexual orientation (heterosexual, bisexual, homosexual) - 401 bestselling Dutch language novels - 50 novels nominated for Dutch literary prizes Each corpus comes with: - LIWC counts; this file also includes the available metadata for each novel. The English data was created with LIWC 2015. The Dutch data was created with the validated translation of LIWC 2001. - Word counts (unigrams) and bigram counts per novel. All text has been converted to lowercase. Contractions are tokenized into separate tokens, e.g., can't => ca n't Two restrictions are applied: - only unigrams or bigrams that occur in at least 10 texts are retained - only the 100k most frequent are retained - Overall word counts and bigram counts; i.e., the sum across all novels. All files are encoded in UTF-8. The word counts were extracted with the countngrams.py script.

本数据集包含多个语料库的词频统计CSV文件: - 694部英语小说,作者涵盖男女,且按作者性取向分为异性恋、双性恋、同性恋三类; - 401部畅销荷兰语小说; - 50部获荷兰文学奖提名的小说。 每个语料库均附带以下内容: - 语言学调查与词数统计(Linguistic Inquiry and Word Count,简称LIWC)计数文件:该文件同时包含每部小说的可用元数据。其中英语数据采用LIWC 2015生成,荷兰语数据则基于经过验证的LIWC 2001译版生成。 - 每部小说的单字(unigrams)词频统计与二元组(bigrams)计数。所有文本均已转换为小写形式,缩写会被拆分为独立的Token(Token),例如can't将被拆分为ca n't。 共应用两项筛选规则: 1. 仅保留在至少10部文本中出现过的单字或二元组; 2. 仅保留频率排名前100,000的单元。 此外还包含整体词频与二元组计数,即所有小说的统计总和。所有文件均采用UTF-8编码,词频统计通过countngrams.py脚本提取生成。
创建时间:
2024-01-23
二维码
社区交流群
二维码
科研交流群
商业服务