Swahili Verb Conjugation Dataset: A Comprehensive Analysis of Agglutination and Verb Structure Across Tenses and Persons
收藏DataCite Commons2025-04-15 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/rvt89578g5/1
下载链接
链接失效反馈官方服务:
资源简介:
This Swahili Verb Conjugation Dataset offers a rich and detailed collection of over 319,156 conjugated verb forms, meticulously compiled to capture the complexity of Swahili’s agglutinative verb morphology. Swahili is known for its rich inflectional system, where verbs are modified by adding prefixes and suffixes to encode grammatical information such as tense, aspect, person, number, and mood.
The dataset consists of a single CSV file, where each row represents a unique verb root (mzizi_wa_neno) and its conjugated forms across multiple dimensions:
Tenses: The dataset captures five core tenses: past, perfect, present, future, and simple present. These tenses play a critical role in Swahili verb conjugation and vary significantly from English tense structures, making this dataset an essential resource for handling these tense markers in NLP tasks.
Persons and Numbers: Conjugations are provided for the 1st, 2nd, and 3rd persons, both singular (umoja) and plural (wingi). Each of these persons is conjugated across the five tenses, providing a comprehensive overview of the morphological changes that occur depending on the subject.
Moods: The dataset includes the habitual mood (hali_ya_mazoea), as well as other modal forms and auxiliary verbs that are part of Swahili’s verb system, such as kum (ability), kuwa (to be), and various hypothetical forms (e.g., ninge, unge, ange for conditional tense).
The columns in the dataset include:
Verb Root (mzizi_wa_neno): The base form of the verb from which all conjugated forms are derived.
Conjugated Forms: These columns represent the verb conjugations for the 1st, 2nd, and 3rd persons, both singular and plural, across all tenses. For example, nafsi_ya_kwanza_umoja_wakati_uliopita refers to the 1st person singular in the past tense.
This dataset not only provides standard conjugations for Swahili verbs but also covers various auxiliary and hypothetical forms. The extensive collection of forms makes this dataset an invaluable resource for researchers interested in Swahili Natural Language Processing (NLP), as it offers the morphological richness needed for tasks like tokenization, lemmatization, and syntactic parsing.
Additionally, this dataset is adaptable for linguistic research beyond computational applications. It can be used to study Swahili verb morphology, tense-aspect systems, and cross-linguistic comparisons with other agglutinative languages.
提供机构:
Mendeley Data
创建时间:
2024-10-22



