five

ARENA_Hierarchical Organization of Distributed Semantic Knowledge in the Human Language System_Language Study Pt. 1: stimulus set

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12742973
下载链接
链接失效反馈
官方服务:
资源简介:
P4_WP1_01 Language Study Pt. 1: stimulus set   Folder structure:  Raw_inputThe raw_input contains the text for every chapter of the book to be used in the experiment (Moonwalk mit Einstein: Wie aus einem vergeßlichen Mann ein Gedächtnis-Champion wurde, by Joshua Foer, translated by Ulla Rahn-Huber. Published by Riemann Verlag (28 Mar. 2011)). Additionally, in the folder there is the pretrained vector model for German words (model comes from https://fasttext.cc/docs/en/crawl-vectors.html), ratings for concreteness and word frequency (all references are included in the scripts). Moreover, there is a translated version of the Things labels for intersecting the single words with the THINGS dataset (Hebart MN, Dickter AH, Kidder A, Kwok WY, Corriveau A, et al. (2019) THINGS: A database of 1,854 object concepts and more than 26,000 naturalistic object images. PLOS ONE 14(10): e0223792. https://doi.org/10.1371/journal.pone.0223792) ScriptsIn this folder, there are 6 scripts to sample single words from the raw text. The scripts are numbered according to the intended order of use. - 1_from_text_to_df.py: from raw text only nouns and verbs are extracted with their relative word frequency, concreteness, lemma form, and number of characters.  - 2_cluster_words.py: cluster analysis of word vectors to sample the semantic space as broadly as possible. Loosely based on Pereira, F., Lou, B., Pritchett, B. et al. Toward a universal decoder of linguistic meaning from brain activation. Nat Commun 9, 963 (2018). https://doi.org/10.1038/s41467-018-03068-4.  - 3_compute_orthographic_density.py: add information for OND20 for both word forms and lemma forms.  - 4_syntactic_valency.py: this applies only to verbs. It counts the number of arguments necessary for a verb to saturate its syntactic valency (e.g., subject + object).  - 5_sampling_nouns.py: it samples nouns by preserving the distributions of all variables. The number of characters for every word is kept under 10. Additionally, the extreme quantiles of concreteness are matched by all other variables to ensure that more concrete and more abstract words in the set are still match along the other ratings.  - 6_sampling_verbs.py: same as above but for verbs.  Stimuli In this folder, the pool of words before sampling is included. Note that some words have been manually excluded for several reasons: e.g., parsed wrongly in their lemma form; offensive words; words coming from other languages.
创建时间:
2024-07-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作