Serbian Adjectival Degree Semantics Corpus (SADSC): Long and Short Form Adjectives
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14673043
下载链接
链接失效反馈官方服务:
资源简介:
The Serbian Adjectival Degree Semantics Corpus (SADSC) is a comprehensive dataset organized around adjectival lemmas in the Serbian language, with a particular focus on the distinction between long and short form adjectives. It provides detailed frequency information and combinations of forms with various degree modifiers, aiming to support research in syntax, semantics, and corpus linguistics.
Purpose of Data Collection
The data were collected to test a novel hypothesis regarding the distinction between long and short form adjectives in Serbian. This hypothesis posits that the distinction revolves around degree semantics, rather than the traditional view that it signals nominal definiteness or specificity. Specifically:
Long form adjectives are hypothesized to be specified for definite degree.
Short form adjectives are hypothesized to be specified for indefinite degree.
This contrasts with the received view, which ties the forms to the definiteness/specificity distinction of the nominal phrase.
Predictions of the Degree-Based Hypothesis
Unlike the nominal definiteness/specificity account, the degree-based analysis predicts:
Correlation between Short Form and Comparative Forms: The availability of short (indefinite) forms should correlate with the presence of comparative forms since both of them are ultiately tied to degrees. Non-gradable adjectives are expected to show no or few instances of comparative forms and high frequencies of long forms because they either do not project scales which are a precondition for the degree (in)definiteness distincton to show up (hence, only the default long form is used) or they project binary (0,1) scales with the value 1 (the presence of the quality) being inherently definite (unique).
Stronger Compatibility with Degree Modifiers:
Short forms are predicted to align better with indefinite degree modifiers (e.g., veoma "very", koliko "how/how much").
Long forms may combine with definite degree modifiers, where the definiteness is signaled by the modifier itself (e.g., toliko "that much").
Alternatively, short forms may show greater propensity to combine with all degree modifiers as in a combination of a short form adjective and a definite degree modifier, the definiteness (uniqueness) component might be compositionally contributed by the degree modifier with the short form contributing only the existential component.
Predictions of the Nominal Definiteness/Specificity Hypothesis
If the long v. short form adjective distinction signals definiteness/specificity of the noun phrase, there is no reason to expect:
any link between the availability/frequency of either long or short forms and comparatives
any difference in the combinability of either long or short forms with degree modifiers of any kind (definite or indefinite).
Finding
The data shows a strong correlation between the availability/frequency of short forms and comparative forms and significantly higher frequences of short forms with all degree modifiers in support of the Degree-Based Hypothesis.
Metadata and CQL Queries
Row 1: Contains the CQL (Corpus Query Language) frames used to generate specific queries. Thes specific queries for each cell in the tabble were used as input to a Python script that performed individual corpus queries designated by CQL expressions in each cell (see Data Collection) and subsequently replaced by the actual frequency counts derived from the corpus.
Row 2: Titles of the columns, serving as headers for the dataset.
Data Organization
The corpus is structured into rows, where each row represents a unique adjectival lemma. The dataset includes the 1,100 most frequent adjectives based on absolute frequency in the first 10,000,000 lines drawn from the Serbian Web Corpus (srWaC) Ljubešić and Klubička 2016. The columns in the dataset contain specific attributes and frequency counts for different forms and combinations of the adjective. Below is a detailed description of each column:
lemma (Column A): The base form of the adjective, sorted by overall frequency in the corpus.
frequency: The total frequency of the lemma in the first 10,000,000 lines of the corpus.
LFfreq: Absolute frequency of the long form of the adjective.
SFfreq: Absolute frequency of the short form of the adjective.
COMPfreq: Frequency of the comparative form of the adjective.
deg[very]_LFfreq: Frequency of the long form of the adjective combined with the degree modifier veoma ("very").
deg[very]_SFfreq: Frequency of the short form combined with veoma.
deg[completely]_LFfreq: Frequency of the long form combined with the degree modifier potpuno ("completely").
deg[completely]_SFfreq: Frequency of the short form combined with potpuno.
wh_SFfreq: Frequency of the short form with the wh degree modifier koliko ("how/how much").
wh_LFfreq: Frequency of the long form with koliko.
tprox_SFfreq: Frequency of the short form with the proximal demonstrative degree modifier toliko ("that much").
tprox_LFfreq: Frequency of the long form with toliko.
oprox_SFfreq: Frequency of the short form with the proximal demonstrative degree modifier ovoliko ("this much").
oprox_LFfreq: Frequency of the long form with ovoliko.
dist_SFfreq: Frequency of the short form with the distal demonstrative degree modifier onoliko ("that much").
dist_LFfreq: Frequency of the long form with onoliko.
indef_SFfreq: Frequency of the short form with the indefinite degree modifier nekako ("somehow").
indef_LFfreq: Frequency of the long form with nekako.
Data Collection
The data were obtained automatically using a specifically designed Python script, Corpus Querier, which ensures fair use compliance. The script is available on GitHub [Kovačević 2025]. The dataset is derived from the Serbian Web Corpus (srWaC), version 1.1, hosted at CLARIN.SI.
References
Kovačević, P. (2025). Corpus Querier: A Python script for automated corpus querying with fair use compliance (Version 1.0) [Computer software]. GitHub. https://github.com/pedja-kovacevic/Corpus-Querier
Ljubešić, N. and Klubička, F., (2016), Serbian web corpus srWaC 1.1, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1063.
创建时间:
2025-02-13



