ARENA_Hierarchical Organization of Distributed Semantic Knowledge in the Human Language System_Language Study Pt. 1: stimulus set
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12742973
下载链接
链接失效反馈官方服务:
资源简介:
P4_WP1_01 Language Study Pt. 1: stimulus set
Folder structure:
Raw_inputThe raw_input contains the text for every chapter of the book to be used in the experiment (Moonwalk mit Einstein: Wie aus einem vergeßlichen Mann ein Gedächtnis-Champion wurde, by Joshua Foer, translated by Ulla Rahn-Huber. Published by Riemann Verlag (28 Mar. 2011)). Additionally, in the folder there is the pretrained vector model for German words (model comes from https://fasttext.cc/docs/en/crawl-vectors.html), ratings for concreteness and word frequency (all references are included in the scripts). Moreover, there is a translated version of the Things labels for intersecting the single words with the THINGS dataset (Hebart MN, Dickter AH, Kidder A, Kwok WY, Corriveau A, et al. (2019) THINGS: A database of 1,854 object concepts and more than 26,000 naturalistic object images. PLOS ONE 14(10): e0223792. https://doi.org/10.1371/journal.pone.0223792)
ScriptsIn this folder, there are 6 scripts to sample single words from the raw text. The scripts are numbered according to the intended order of use. - 1_from_text_to_df.py: from raw text only nouns and verbs are extracted with their relative word frequency, concreteness, lemma form, and number of characters.
- 2_cluster_words.py: cluster analysis of word vectors to sample the semantic space as broadly as possible. Loosely based on Pereira, F., Lou, B., Pritchett, B. et al. Toward a universal decoder of linguistic meaning from brain activation. Nat Commun 9, 963 (2018). https://doi.org/10.1038/s41467-018-03068-4.
- 3_compute_orthographic_density.py: add information for OND20 for both word forms and lemma forms.
- 4_syntactic_valency.py: this applies only to verbs. It counts the number of arguments necessary for a verb to saturate its syntactic valency (e.g., subject + object).
- 5_sampling_nouns.py: it samples nouns by preserving the distributions of all variables. The number of characters for every word is kept under 10. Additionally, the extreme quantiles of concreteness are matched by all other variables to ensure that more concrete and more abstract words in the set are still match along the other ratings.
- 6_sampling_verbs.py: same as above but for verbs.
Stimuli In this folder, the pool of words before sampling is included. Note that some words have been manually excluded for several reasons: e.g., parsed wrongly in their lemma form; offensive words; words coming from other languages.
创建时间:
2024-07-15



