Annotated database of Slovenian adjectives

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/15174243

下载链接

链接失效反馈

官方服务：

资源简介：

This database presents the morphological annotation of Slovenian adjectives. It includes the 6,000 most frequent adjectives in Slovenian, extracted from the Gigafida 2.0 corpus (deduplicated) using the CQL [tag="P.*"] on a random sample of 10,000,000 lines in the NoSketch engine in March 2024. Among the adjectives on the list, there are some homophonous items and given that the corpus is not annotated for meaning, homophonous adjectives were counted as a single item. For example, premočen can either mean ‘soaked’ (from the verb premočiti ‘to drench’) or ‘too strong’ (from močen ‘strong’). The annotator decided which meaning they perceive as more salient and annotated the item as that specific adjective. Proper names were not annotated as morphologically complex. For example, the possessive Gregorinov ‘Gregorin’s’ (Gregorin is a last name) is only marked for the possessive suffix -ov, even if the last name itself is probably decomposable to Gregor+in. Column-by-column overview We start by listing the columns in the database and showing what property is annotated in each of them. Column A: ID Adjectives are annotated with consecutive numbers, and this column contains a unique number assigned to each adjective. Column B: Adjective This column lists the citation form (lemma) of each adjective. Column C: Frequency This column provides the frequency of each individual adjective's lemma. Column D: Included This column distinguishes between items we consider actual adjectives in the relevant sense from all other items. Words marked with 1 are included in the annotation, while items marked with a 0 are excluded. The reasons for exclusion are: the item not being an adjective, the item being misspelled and the item being a proper name or a part of a proper name. Columns E–K: Suffix 1 to Suffix 7 These columns list the specific suffixes contained in each adjective. Suffix 1 is the one closest to the root, followed by Suffix 2, and so on. The aim was to pursue maximal decomposition. Therefore, for instance, the possessive adjectival pronoun svoj ‘own’ was decomposed into s-v-oj, based on its relation to s-eb-e ‘oneself’ as well as to m-oj ‘my’ and t-v-oj ‘your’. See Appendix for the specific decisions regarding the annotation. Column L: Ending If the adjective has a phonologically overt inflectional ending (e.g., slovensk-i ‘Slovenian’), this ending is listed in column E. Column M: Prefixes If the adjective has a prefix, the prefix is listed in this column. Prefixes in loanwords are annotated in this column if the version without the prefix (or with some other prefix) also exists in Slovenian. E.g., iracionalen ‘irrational’ is annotated as having the prefix i- because racionalen ‘rational’ also exists. On the other hand, dis- in diskonten ‘discount’ is not given in this column, because *konten does not exist in Slovenian If the adjective has several prefixes, these are listed in the column and are separated by a plus sign. The rightmost prefix is the one closest to the root/base. If the prefix is marked with an asterisk, this prefix modifies an existing adjective. In other cases, the prefix is a part of a non-adjectival base that got adjectivised. Compare: predolg ‘too long’: prefix pre* (dolg ‘long’ is an adjective) preminul ‘dead’: prefix marked as pre (the adjective is derived from the verb preminiti ‘to die’) zavezniški ‘ally’: prefix marked as za (the adjective is derived from the noun zaveznik ‘ally’). Items that could be taken to be a prefix, but an unprefixed version of the base (or a version with a different prefix) is not attested, are given in brackets. For instance zanikrn ‘sloppy’ has (za) in this column, since the annotator has the intuition that za is a prefix in this word, but *nikrn is not attested. If it was unclear whether the item in question was a single prefix or could be further decomposed, a potential decomposition is provided. One such example is izpodbijan ‘contentious’ where prefixes iz- and pod- also exist (as do prepositions iz, pod and izpod), which is why prefixes iz+pod were annotated. Column N: Non-derived adjective Adjectives that are taken to be non-derived (i.e., in cases where we have no arguments to assume they are morphologically complex) get a 1 in this column (if not, they are assigned a 0). For instance, bled ‘pale’ has a 1 in this column, whereas mandlj-ev ‘made out of almonds’ has a 0. Column O: Zero Adjectives that contain a base from a different category or a compound base, but do not include an overt adjectivising morpheme, are assigned a 1 in this column (if not, they are assigned a 0). An example is drag-o-cen-Ø lit. expensive-o-price ‘invaluable’. Column P: Compound base Adjectives that have a compound base get a 1 in this column (if not, they are assigned a 0). If an adjective is annotated with a 1, the right component of the compound is decomposed for suffixes only. For instance, drug-o-uvrščen ‘runner-up’ (literally second-o-classified) has the prefix u- in the right part, but this is not annotated separately. Loan adjectives are marked as having a compound base if the components of the base are used in other contexts in Slovenian. E.g., the base of radiološki ‘radiological’ is radiolog, which contains radio, used as an independent word meaning ‘radio’ and -log, also attested in, e.g., psiholog ‘psychologist’, arheolog ‘archeologist’. Finally, if an item is marked as a compound, it is not also marked as participial, even if it contains a deverbal participle. A case in point is drug-o-uvrščen ‘runner-up’, which contains the passive participle of the verb uvrstiti ‘classify’. Column R: PTCP If the adjective is a passive or active participle, it is assigned a 1 in this column. If not, they are assigned a 0. Appendix: Specific decisions for the annotation of suffixes in columns E–J The general criterion for annotating an element as a suffix was its occurrence in multiple adjectives and/or in combination with other suffixes. Crucially, this means that we also attempted to decompose elements sometimes considered a single suffix. For example, -kast in siv-kast ‘gray-ish’ was annotated as siv+k+ast, since both -k and -ast are independently attested suffixes (kič-ast ‘kitsch-y’, ljub-(e)k ‘cute’). Especially in the domain of borrowed words, in some cases, it was impossible to reconstruct the underlying representation of suffixes that only appear before palatalising suffixes. For instance, in sarkastičen ‘sarcastic’, the sequence -ič- can, in principle, be underlyingly -ik-, -ic-, or -ič-, as all these underlying representations could lead to the surface allomorph -ič-. In such cases, we opted for analogy with comparable words whose intermediate bases do surface as independent words. In this case, an analogy can be made with words like logističen ‘logistic’, with the base logistika ‘logistics’. As a consequence, sarkastičen was annotated as sarkast+ik+n. Some nominal bases display so-called stem extensions, which occur throughout the paradigm of the noun (e.g. vrem-e ‘weather’ has the genitive singular vrem-en-a, dative singular vrem-en-u etc.) Stem extensions like en were not annotated as derivational suffixes, so that e.g., vrem-en-sk-i is annotated as having only the suffix sk. Similarly, many nouns ending in -r in the nominative singular get an extra -j in other forms in the paradigm. Because -j is present in the declension of the noun, it was not annotated as a suffix. E.g. krompir ‘potato’ has the genitive singular krompir-j-a. The related adjective krompir-j-ev ‘related to potato’ is annotated krompir+ov (see below for -ov vs. -ev). The same goes for loanwords ending on a vowel, e.g., kupe ‘compartment’ with the genitive singular kupeja ‘compartment’, where the adjective kupejevski ‘related to a compartment’ is annotated as kupe+ov+sk+i. Phonologically conditioned allomorphs were generally annotated as a single morpheme: The morpheme -ov- systematically surfaces as -ev- after a set of consonants traditionally termed soft (j, c, č, ž, š). This morpheme was annotated as -ov- regardless of its surface form. E.g., zmajev ‘dragon’s’ was annotated as zmaj+ov. This allows us to make a distinction between the morpheme -ov and the morpheme -ev-, which can appear after all consonants (e.g. in pre+hit+ev+a+l+(e)n ‘overtaking’). Consonant-initial suffixes can trigger the insertion of an epenthetic vowel in some forms. In such cases, the suffix was annotated in the version without the epenthetic vowel. For instance, kosten ‘bony’ was annotated as kost+n, because the e does not surface in the genitive kost+n+ega, the feminine nominative singular kost+n+a, the neuter nominative singular kost+n+o etc. This allows us to make a distinction between -n and -en, which always surfaces with the vowel. An example is zastražen ‘guarded’ (masculine genitive singular zastraž+en+ega, feminine nominative singular zastraž+en+a, neuter nominative singular zastraženo etc.). The same holds for some other consonant-initial suffixes, as in ljub-ek ‘cute’ (cf. ljub-k-ega, ljub-k-a, ljub-k-o), which was consistently annotated as -k. In cases where glide formation is fully predictable, j is not annotated as a separate morpheme. For example, the verbal root bi ‘hit’ surfaces as bij whenever followed by a vowel. This is also the case in the secondary imperfectivisation iz-pod-bij-a-ti ‘to challenge’. Since glide formation is predictable, the passive participle of this verb, iz-pod-bij-a-n, was annotated as only having the suffixes a+n. When annotating passive participles, we assume that theme vowels are present whenever they can be reconstructed. The clearest case are passive participles in -a-n, where the theme vowel is preserved. For example, kuh+a+n ‘cooked’ from kuh-a-ti ‘to cook’, where the theme vowel a is annotated separately. We also annotated the theme vowel in cases where it survives in the form of a consonant (e.g., s-pre-men-j-en ‘changed’ from s-pre-men-i-ti ‘habituate’, annotated as s+pre+men+i+en), multiple consonants (e.g., z-gub-lj-en ‘lost’ from z-gub-i-ti ‘to lose’, annotated as z+gub+i+en) or through the palatalisation of the preceding consonant (e.g., na-va-j-en ‘habituated’ from na-vad-i-ti ‘habituate’, annotated as having the suffixes i+en since dj systematically palatalises to j). In the verbal domain, the suffix ov systematically varies between two allomorphs: ov (e.g., in the infinitive pot-ov-a-ti ‘to travel’) and u (e.g. in pot-u-je-mo ‘we travel’). In derivation, the former allomorph is generally used, e.g., in pot-ov-a-l-en ‘related to travel’. However, in the so called active adjectival participles, the latter allomorph surfaces, e.g., in pot-u-j-oč ‘travelling’. In both cases, we annotated the relevant morpheme as ov, so that pot-u-j-oč was annotated as pot+ov+j+oč. If a derivational affix generally does not trigger the palatalisation of the preceding consonant, then in all cases where palatalisation does occur, the word is assumed to contain an additional palatalising morpheme -j-. For instance, the suffix -en generally does not trigger palatalisation (e.g., polst-en ‘made of felt’ derived from polst ‘felt’). Therefore, košč-en ‘bony’ derived from kost ‘bone’ is assumed to have an extra palatalising morpheme and was annotated as kost+j+en.

创建时间：

2025-04-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集