Replication data for Early Slavic When-clauses in cross-linguistic perspective

Figshare2026-02-06 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Replication_data_for_Early_Slavic_When-clauses_in_cross-linguistic_perspective/28107050

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains the data associated with the article Early Slavic When-clauses in cross-linguistic perspective, currently under review.The repository also includes the semantic maps for all languages in the parallel dataset (including those not shown in this paper), the Python scripts to reproduce the pipeline in Section \ref{firstsec} (from generating a Hamming distance matrix to applying Ordinary Kriging), and a \texttt{README.md} describing the repository contents, instructions for running the scripts, and the New Testament translation version used for each language.The README.md file (COMING SOON!) contains much of the same content found below, and it is included mainly for exportability, but it also contains more information about the New Testament versions used for each language in the dataset.The following data can be found under /datasets:df_when_no_ngrams.csv: this is the raw aligned data, containing one context per row in which the word when occurs in the English Standard Version of the New Testament, and one column per language in the dataset, with the automatically aligned parallel in the respective row. NOMATCH indicates that the alignment suggests no parallel word could be found. Empty values indicate that the relevant language does not have a translation for the respective Bible verse.df_when_ngrams.csv: this is the parallel data after applying the n-gram search method described in the article. N-gram groups (i.e., possible allomorphs of a morpheme potentially meaning 'when') are labelled as ngram_1, ngram_2, ngram_3, etc. To know what each group correspond to, check the following file (ngram_details.txt). Post-corrected and functionally-labelled data (used in the paper to generate semantic maps), alongside the original one, can also be found in this .csv. Column names with -corrected appended to the language ISO code means that the values were checked, either manually or semi-automatically.ngram_details.txt: breakdown of n-gram groups automatically found for each language in the dataset and what character n-grams they actually refer to.glottolog_withfams.csv: a table based on Glottolog's classification containing a mapping between ISO language codes and language name, language family, and world region in which the language is primarily used. This dataset was used to automatically generate the title of the semantic maps in the paper.The following can be found under scripts/:[CONTENT COMING SOON!]The folder maps/ contains all the semantic maps generated based on the data in df_when_ngrams.csv.

创建时间：

2026-02-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集