Replication data for: A quantitative and typological study of Early Slavic participle clauses and their competition (University of Oxford, DPhil Thesis)

Name: Replication data for: A quantitative and typological study of Early Slavic participle clauses and their competition (University of Oxford, DPhil Thesis)
Creator: Pedrazzini, Nilo
Published: 2023-10-23 00:00:00
License: 暂无描述

Figshare2023-10-23 更新2026-04-08 收录

下载链接：

https://figshare.com/articles/dataset/Replication_data_for_A_quantitative_and_typological_study_of_Early_Slavic_participle_clauses_and_their_competition_University_of_Oxford_DPhil_Thesis_/24166254/1

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains data and scripts to reproduce the analysis in Pedrazzini, Nilo. 2023. A quantitative and typological study of Early Slavic participle clauses and their competition. University of Oxford PhD Thesis.In particular, the repository contains the following:The (Python 3) extraction scripts to obtain CSV datasets of conjunct participles, dative absolutes and jegda-clauses from the TOROT and PROIEL treebanks, as well as from 'strategically-annotated' treebanks. (NB: It does not contain the treebanks themselves - those should be downloaded from the respective official releases!)The outputs of the above scripts as of January 2023 (marianus_[CONSTRUCTION]_curated.csv for the Codex Marianus; shallow_torot_[CONSTRUCTION].csv for the rest of TOROT; [CONSTRUCTION]_strategic.csv for strategically annotated treebanks.The dataset of adverbial clauses from the Luke Gospels in the Greek New Testament (PROIEL) containing annotation for rhetorical relations.Datasets containing all occurrences of conjunct participles, genitive absolutes, and hote/hotan clauses in the Greek New Testament in PROIEL.A dataset containing all post-matrix dative absolutes from the TOROT treebank, with added manual annotation on whether they can be interpreted as ELABORATIONS or FRAMES (das_as_elab.csv)All the datasets used for the case study in Chapter 2 of the thesis on strategically annotated treebanks.A generic preprocessing scripts (torot_preprocessing.py).A lemmatization script (assign_lemma.py) with annexed form:lemma dictionary (lemma_df.csv).Scripts to calculate relative saliency (i.e. number of mentions in the previous discourse) (calculate_saliency.py), topicworthiness (assign_topic_score.py), and distance of immediately preceding antecedents of nominal referents (calculate_distance.py), as described in the thesis and usable only for the information-structurally annotated texts in PROIEL, namely only the Codex Marianus as of October 2023.Script to establish whether jegda-clauses have co-referential subjects with their matrix clause (egda_establish_if_samesubj.py). Only for the Codex Marianus, since it relies on anaphoric link annotation.Scripts to add syntactic variables to the occurrences, including clause size (in number of tokens), number of intervening verbs (between construction and matrix clause), intervening verb relations, and whether it is the first clause in a sentence (extract_syntactic_predictors.py).The outputs of the above script, namely the original CSV extracted from TOROT, but with added syntactic predictors (same name, with with_syntax_predictor.csv appended).The dataset of English when and its parallels in the massively parallel corpus used in Chapters 5 and 6, but with added columns with the functional annotation of Pular occurrences, as well as the automatic addition of switch-reference markers to other languages, as described in Chapter 6.The remaining scripts and data used in Chapter 5 and 6 can be found in the repository associated with Haug, Dag & Nilo Pedrazzini. Forthcoming. The semantic map of when and its typological parallels. Frontiers in Communication: https://doi.org/10.6084/m9.figshare.22072169.v1

提供机构：

Pedrazzini, Nilo

创建时间：

2023-10-23