five

Romanian Weak Pronoun Choice Data

收藏
doi.org2023-09-28 更新2025-01-16 收录
下载链接:
https://doi.org/10.18710/GSV27M
下载链接
链接失效反馈
官方服务:
资源简介:
The following corpus study shows that soft linguistic constraints are hard to describe and operationalize. In specific contexts, some Romanian clitic pronouns allow a choice between phonological hosts such as in că-mi dai cartea vs. că îmi dai cartea both meaning [that you give me the book]. What determines the choice between subjunction că in că-mi and prosthetic î in îmi (cf. Lombard 1976)? Popescu (2003, p. 160) argues for speech rate as surface realization trigger (monosyllabic că-mi in fast speech vs. bisyllabic că îmi in normal speech), while Dindelegan (2013, p. 388) argues for register rules (informal că-mi vs. formal că îmi). This means that formal, written language represents one extreme of a formality scale while informal, spoken language the other. Thus, a Romanian corpus of official documents, such as legal texts, is expected to contain only or significantly many forms with prosthetic î for constellations with otherwise optional variants. To test these two hypotheses, the Romanian part of the JRC-Acquis corpus (http://ec.europa.eu/dgs/jrc/) has been tagged with the RACAI tagger (http: //www.racai.ro). The resulting corpus of 62,650,821 tokens (including punctuation) has been evaluated wrt. the phenomena under scrutiny. Taking into account specific hosts, enclitic forms have been compared with their î-prosthetic counterparts. The numbers show almost no or statistically insignificant difference in usage for some specific host+clitic pairs (e.g., 3886 să îşi vs. 3852 să-şi [that to himself/ herself ], 200 ce îi vs. 110 ce-i [what to him/her]). From a usage-based perspective, these findings are clear arguments both against the register rules purported by D indelegan (2013) and against a pure speech rate hypothesis as in Popescu (2003). Since the JRC-Acquis corpus is translated from English by different translators, perhaps both native and non-native speakers of Romanian, a further corpus of original Romanian legal texts is being compiled for further analysis and comparison. The full dataset consists of (1) two tgz-files containing the pos-tagged data extracted from the JRC-Acquis corpus: enclitic forms and î-prosthetic forms. The data is xml format, which is described in (2) the description file. (3) the draft of the article as pdf-file for linguistic background.

本语料库研究揭示了软性语言约束的描述和操作化之难。在特定语境中,某些罗马尼亚语粘着代词允许在语音宿主之间进行选择,例如在 că-mi dai cartea 与 că îmi dai cartea 之间,两者均意为“你给我书”。是什么因素决定了 că-mi 中的子句 că 与 îmi 中的补足性 î(参照 Lombard 1976)之间的选择?Popescu(2003,第160页)认为语速是表面实现触发因素(快速语速中的单音节 că-mi 与正常语速中的双音节 că îmi),而 Dindelegan(2013,第388页)则主张语域规则(非正式的 că-mi 与正式的 că îmi)。这意味着,正式的书面语言代表了正式性尺度的一端,而口语化的口头语言则是另一端。因此,一个包含官方文件,如法律文本的罗马尼亚语语料库,预计将仅包含或显著包含许多带有可选变体星座的补足性 î 的形式。为了检验这两种假设,JRC-Acquis 语料库的罗马尼亚语部分(http://ec.europa.eu/dgs/jrc/)已由 RACAI 标注器(http://www.racai.ro)进行标注。由此产生的包含 62,650,821 个标记(包括标点符号)的语料库已经针对审查的现象进行了评估。考虑到特定的宿主,粘着形式与其 î 补足性对应形式进行了比较。数据显示,对于一些特定的宿主+粘着对,在用法上几乎不存在或存在统计学上不显著的差异(例如,3886 să îşi 与 3852 să-şi [给他/她],200 ce îi 与 110 ce-i [对他/她是什么])。从基于用法的视角来看,这些发现既是对 D indelegan(2013)所声称的语域规则的明确反驳,也是对 Popescu(2003)提出的纯粹语速假设的明确反驳。鉴于 JRC-Acquis 语料库由不同的翻译者从英语翻译而来,其中包括罗马尼亚语的母语者和非母语者,因此正在编纂另一个包含原始罗马尼亚语法律文本的语料库,以进行进一步的分析和比较。 该完整数据集包括(1)两个包含从 JRC-Acquis 语料库中提取的词性标注数据的 tgz 文件:粘着形式和 î 补足性形式。数据为 xml 格式,具体描述见(2)描述文件。(3)文章草案作为 pdf 文件,用于语言背景。
提供机构:
DataverseNO
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作