The raw n-grams dataset for Rajeg et al.’s (2022) “The Spatial Construal of TIME in Indonesian: Evidence from Language and Gesture”
收藏DataCite Commons2024-09-30 更新2024-11-06 收录
下载链接:
https://figshare.com/articles/dataset/The_raw_n-grams_dataset_for_Rajeg_et_al_s_2022_The_Spatial_Construal_of_TIME_in_Indonesian_Evidence_from_Language_and_Gesture_/27138921
下载链接
链接失效反馈官方服务:
资源简介:
<b>How to cite</b>Rajeg, Gede Primahadi Wijaya (2024). The raw n-grams dataset for Rajeg et al.’s (2022) “The Spatial Construal of TIME in Indonesian: Evidence from Language and Gesture”. figshare. Dataset. https://doi.org/10.6084/m9.figshare.27138921<b>Overview</b>A dataset of non-tabulated (raw) n-grams (from 2-grams up to 5-grams) derived from a corpus file in the <i>Indonesian Leipzig Corpora Collection</i> (ILCC), that is the “ind_newscrawl_2016_1M-sentences.txt”, the latest addition to the ILCC when the project associated with the generation of these n-grams was started in 2018. These large datasets were generated using R via one of Monash University’s high-performance computing facilities, <b>MonARCH</b>. The datasets became the basis for the linguistic analyses in the following publication:Rajeg, Gede Primahadi Wijaya, Poppy Siahaan & Alice Gaby. 2022. The Spatial Construal of TIME in Indonesian: Evidence from Language and Gesture. <i>Linguistik Indonesia</i> 40(1). 1–24. https://doi.org/10.26499/li.v40i1.297.This repository also includes the R scripts used to create the n-grams. The key R package to produce the n-gram (including the corpus tokenisation) is quanteda (Benoit et al. 2018), supported by the suit of R packages from the tidyverse (Wickham et al. 2019), tidytext (Silge & Robinson 2017), and corplingr (Rajeg 2021). Line 60 onwards in the file <code>R-script-ngram-creation-2-4-grams.R</code> shows how to search/filter and tabulate the n-gram frequency for a given time noun (i.e., <i>tahun</i> ‘year’ in the example).<b>References</b>Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach (First edition). O’Reilly.Benoit et al., (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774, https://doi.org/10.21105/joss.00774 https://quanteda.io.Wickham et al., (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686Rajeg, G. P. W. (2021). corplingr: Tidy concordances, collocates, and wordlist. Open Science Framework (OSF). https://doi.org/10.17605/OSF.IO/X8CW4 https://github.com/gederajeg/corplingr/.
提供机构:
figshare
创建时间:
2024-09-30



