Replication Data for: Predicting Stress in Russian using Modern Machine-Learning Tools
收藏DataONE2025-03-04 更新2025-04-26 收录
下载链接:
https://search.dataone.org/view/sha256:43ac7eb1bcb2384438c65826cc2042edeef2f199d0d100ec2de9780291a97a42
下载链接
链接失效反馈官方服务:
资源简介:
This dataset consists of a TSV file with five columns of data originating in Zaliznyak's Grammar and Dictionary (1977). The data was programmatically scraped from Giella project data (Moshagen et al., 2013) by Spektor (2021). From Spektor (2021), the data was one of four sources in their RusLex application. Once scraped from there, only symbols were removed. The Russian word data is preserved from the original in Cyrillic. The last column contains abbreviated morphological features in English (e.g. \"V\" for verb, \"N\" for noun, \"Fem\" for feminine, \"Cmpr\" for comparative, \"Impf\" for imperfect). The often many features are separated by semicolons. Stress codes were derived for each word that represented stress placement: If the stressed vowel was at the end of the word a stress code of 0 signifying oxytone stress was assigned. Next, counting from the end of the word, the penultimate stress was given a 1, meaning a stress on the paroxytone. Next, if the antepenultimate syllable contained the stress, the word was assigned a 2, meaning a stress on the proparoxytone. The script continued until a stress code was assigned with the following exceptions: a -1 is assigned for those words without explicit stress markers. The columns in the resultant TSV are: the word without stress markers, the word with stress markers, the derived stress code, the lemma, and all morphological features. The dataset contains over 300,000 words from Zaliznyak (1977) with many repeated words that have unique morphological features. Please see the paper for a full description of the dataset. References: Moshagen, Sjur N., Tommi Pirinen, and Trond Trosterud. (2013). Building an open-source development infrastructure for language technology projects. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), (pp. 343–352). Spektor, Y. (2021). Detection and morphological analysis of novel Russian loanwords (Master’s thesis, CUNY Graduate Center, New York, NY). Retrieved from https://academicworks.cuny.edu/gc_etds/4572/ Zaliznyak, A.A. (1977). Grammatičeskij slovar’ russkogo jazyka. Slovoizmenenie [A grammatical dictionary of Russian: Inflection]. Moscow: Russkij jazyk
本数据集为一份制表符分隔值(Tab-Separated Values, TSV)文件,包含五列数据,其源头为扎利兹尼亚克《俄语语法词典:词形变化》(1977)。该数据由斯佩克托(2021)从吉埃拉项目(Giella project)数据(莫沙根等,2013)中通过编程方式爬取得到。在斯佩克托(2021)的研究中,该数据是其RusLex应用程序的四大来源之一。从该来源爬取后,仅对其中的符号进行了清理移除。俄语词汇数据保留了原始西里尔字母(Cyrillic)的书写形式。
末列包含英文缩写的词法形态特征(例如,"V"代表动词,"N"代表名词,"Fem"代表阴性,"Cmpr"代表比较级,"Impf"代表未完成体),多个特征间以分号分隔。
研究人员为每个词汇推导了表示重音位置的重音编码:若重读音节位于词尾,则为其分配编码0,代表词尾重音(oxytone stress);若从词尾向前数,重音位于倒数第二音节,则分配编码1,代表倒数第二重音(paroxytone stress);若重音位于倒数第三音节,则分配编码2,代表倒数第三重音(proparoxytone stress)。该编码规则可依此类推,仅存在以下例外:对于无明确重音标记的词汇,分配编码-1。
最终生成的TSV文件包含以下五列:无重音标记的词汇、带重音标记的词汇、推导得到的重音编码、词元(lemma)以及全部词法形态特征。
本数据集包含源自扎利兹尼亚克(1977)的逾30万条词汇数据,其中存在大量拥有独特词法形态特征的重复词汇。如需了解本数据集的完整说明,请参阅相关学术论文。
参考文献:
莫沙根(Sjur N. Moshagen)、皮里宁(Tommi Pirinen)与特罗斯特鲁德(Trond Trosterud). (2013). 为语言技术项目构建开源开发基础设施. 见《第19届北欧计算语言学会议(NODALIDA 2013)论文集》,第343–352页。
斯佩克托(Y. Spektor). (2021). 新型俄语外来词的检测与词法形态分析(硕士学位论文,纽约市立大学研究生中心,美国纽约州纽约市). 取自 https://academicworks.cuny.edu/gc_etds/4572/
扎利兹尼亚克(A.A. Zaliznyak). (1977). 《俄语语法词典:词形变化》(Grammatičeskij slovar’ russkogo jazyka. Slovoizmenenie). 莫斯科:俄语出版社(Russkij jazyk)
创建时间:
2025-03-05



