izaitova/slavic_fixed_expressions

Name: izaitova/slavic_fixed_expressions
Creator: izaitova
Published: 2023-10-09 11:55:54
License: 暂无描述

Hugging Face2023-10-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/izaitova/slavic_fixed_expressions

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - ru - be - bg - cs - pl - uk tags: - fixed expressions - idioms - slavic size_categories: - n<1K viewer: false --- This dataset contains a cross-lingual inventory of non-compositional microsyntactic units in six Slavic languages, namely, Belarusian, Bulgarian, Czech, Polish, Russian, and Ukrainian. The original expressions in Russian were retrieved from the Russian National Corpus (RNC) (https://ruscorpora.ru/) that provides a dictionary of syntactic idioms includes prepositions, adverbial and predicatives, parenthetical expressions, conjunctions, and particles. For each Russian expression, the database also provides its frequency score in the RNC sub-corpora. We sorted the available expressions by their frequency scores and selected the 50 most frequent microsyntactic units from each category except for particles, for which only 27 distinct expressions are available. This got us a total of 227 microsyntactic units in Russian. Next, we used the search function of the RNC to extract translational correlates together with two parallel bilingual context sentences from the parallel sub-corpora. We acknowledge that direct correlates of microsyntactic units in different languages are not always available. Thus, we used partial correspondence whenever required, which we believe still allows us to compare microsyntactic units at scale. In a similar way, we used the search function of Czech National Corpus (https://korpus.cz/). We obtained six parallel sets of 227 microsyntactic units with parallel bilingual context sentences for each unit in all of the six Slavic languages. Each of the translated expressions and sentences was proofread and, when required, corrected by professional linguists that are also native speakers of the target language. Our multilingual database of syntactic idioms (microsyntactic phenomena) enables us to compare these phenomena across different languages and can be later used for further research on syntactic idiomaticity. To the best of our knowledge, this is the first database of its kind. Authors: * Iuliia Zaitova (Saarland University) * Irina Stenger (Saarland University) * Tania Avgustinova (Saarland University)

提供机构：

izaitova

原始信息汇总

数据集概述

数据集内容

包含六种斯拉夫语言（白俄罗斯语、保加利亚语、捷克语、波兰语、俄语、乌克兰语）的非组合性微观句法单位的跨语言清单。
数据集共收录227个微观句法单位，每个单位附带平行双语上下文句子。

数据集特点

通过俄罗斯国家语料库（RNC）和捷克国家语料库（CNC）的搜索功能，提取翻译对应项及平行双语上下文句子。
所有翻译表达和句子均由专业语言学家（同时也是目标语言的母语者）校对和必要时的修正。

数据集用途

用于跨语言比较微观句法现象，支持进一步的句法习语性研究。

数据集作者

Iuliia Zaitova (Saarland University)
Irina Stenger (Saarland University)
Tania Avgustinova (Saarland University)

许可协议

CC-BY-4.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集