Replication data for: Who needs particles? A challenge to the classification of particles as a part of speech in Russian
收藏doi.org2023-09-28 更新2025-01-08 收录
下载链接:
https://doi.org/10.18710/700FNV
下载链接
链接失效反馈官方服务:
资源简介:
In 1985, Zwicky argued that “particle” is a pretheoretical notion that should be eliminated from linguistic analysis. We propose a reclassification of Russian particles that implements Zwicky’s directive. Russian particles lack a coherent conceptual basis as a category and many are ambiguous with respect to part of speech. Our corpus analysis of Russian particles addresses theoretical questions about the cognitive status of parts of speech and practical concerns about how particles should be represented in computational models. We focus on nine high-frequency words commonly classed as particles: ešče, tak, ved’, slovno, daže, že, li, da, net. We show that current tagging of particles in the manually disambiguated Morphological Standard of the Russian National Corpus (RNC) is not entirely consistent, and that this can create challenges for training a part-of-speech tagger. We offer an alternative tagging scheme that eliminates the category of “particle” altogether. We show that our enriched scheme makes it possible for a part-of-speech tagger to achieve more useful results. Our analysis of particles provides a detailed account of various sub-uses that correspond to different parts of speech, their relationships, and relative distribution. In this sense, our study also contributes to the study of words that exhibit part-of-speech ambigu ities. We construct a database by extracting from the RNC gold standard 100 random sentences for each of the nine focus words. This database is used for both training and testing a Hidden Markov Model (HMM) trigram tagger (Halácsy et al. 2007), which is the standard model for training part-of-speech tagging. This is done in two rounds: in Experiment 1 we use the tagging of the nine words as in the RNC, including the use of “particle” as a tag; in Experiment 2 we use our own tagging scheme which eliminates “particle” as a tag. In both experiments we partition our database into ten chunks and perform a ten-fold cross-validation, each time using 90 sentences as the training set and 10 sentences as the test set. This means that each part of the total set is tested in the course of the ten repetitions of training and testing.
在1985年,Zwicky提出‘粒子’这一概念属于前理论范畴,应当从语言分析中予以剔除。本研究旨在对俄语中的粒子进行重新分类,遵循Zwicky的指导原则。俄语粒子作为一类缺乏统一概念基础的词汇,许多粒子在词性上存在歧义。本语料库对俄语粒子的分析旨在探讨词性的认知地位这一理论问题,以及粒子在计算模型中如何表征这一实践问题。本研究聚焦于九个高频粒子词:ešče、tak、ved’、slovno、daže、že、li、da、net。研究发现,目前对俄罗斯国家语料库(RNC)中手动消歧的形态标准中粒子的标注并不完全一致,这可能会为词性标注器的训练带来挑战。本研究提出了一种替代标注方案,彻底消除了‘粒子’这一类别。研究表明,我们的改进方案使得词性标注器能够获得更为有用的结果。本研究对粒子的分析详细阐述了其与不同词性相对应的子用法、相互关系以及相对分布。从这一意义上讲,本研究亦对那些表现词性歧义的词语研究做出了贡献。我们构建了一个数据库,从RNC的黄金标准中提取了九个焦点词的100个随机句子。该数据库用于训练和测试一个隐马尔可夫模型(HMM)三元标注器(Halácsy等人,2007年),这是词性标注训练的标准模型。这一过程分为两个阶段:在实验1中,我们采用了与RNC中相同的九个词的标注,包括将‘粒子’作为标注;在实验2中,我们采用了自己的标注方案,消除了‘粒子’作为标注。在两个实验中,我们将数据库分为十份,进行十次交叉验证,每次使用90个句子作为训练集,10个句子作为测试集。这意味着在训练和测试的十个重复过程中,每个总集的部分都会被测试。
提供机构:
DataverseNO



