five

Swahili Verb Conjugation Dataset: A Comprehensive Analysis of Agglutination and Verb Structure Across Tenses and Persons

收藏
DataCite Commons2025-05-01 更新2025-05-17 收录
下载链接:
https://data.mendeley.com/datasets/rvt89578g5
下载链接
链接失效反馈
官方服务:
资源简介:
The Swahili Verb Conjugation Dataset is an extensive resource containing over 319,156 meticulously compiled verb forms, designed to capture the intricate agglutinative morphology of Swahili. This Bantu language, widely spoken across East Africa, features a highly developed inflectional system in which verbs are modified through prefixes and suffixes to encode grammatical categories such as tense, aspect, mood, person, and number. Dataset Overview The dataset is provided as a single CSV file, with each row representing a unique verb root (mzizi_wa_neno) and its corresponding conjugated forms across various linguistic dimensions: Tenses The dataset covers five fundamental tenses—past, perfect, present, future, and simple present—each essential for understanding the temporal structure of Swahili. These tenses exhibit significant differences from their English counterparts, making the dataset particularly valuable for natural language processing (NLP) tasks requiring precise tense handling. Persons and Numbers Conjugations are provided for the 1st, 2nd, and 3rd persons in both singular (umoja) and plural (wingi) forms. Each person is conjugated across all five tenses, offering a comprehensive representation of subject-verb agreement in Swahili. Moods The dataset incorporates a range of moods, including the habitual mood (hali_ya_mazoea) and various auxiliary and hypothetical forms. These include modal constructs like kum (ability), kuwa (to be), and conditional forms such as ninge, unge, ange. Dataset Structure The dataset includes the following columns: Verb Root (mzizi_wa_neno): The base form from which all conjugated forms are derived. Conjugated Forms: Columns detailing conjugations for the 1st, 2nd, and 3rd persons in singular and plural forms across all tenses. For example, nafsi_ya_kwanza_umoja_wakati_uliopita specifies the 1st person singular in the past tense. Applications This dataset is an invaluable resource for both computational and theoretical linguistic research: Natural Language Processing: The morphological richness of Swahili verbs makes the dataset particularly suited for NLP tasks, including tokenization, lemmatization, syntactic parsing, and machine translation. Linguistic Analysis: Researchers can use the dataset to study Swahili’s verb morphology, tense-aspect systems, and comparative analyses with other agglutinative languages. The dataset’s comprehensive coverage of conjugations, including auxiliary and hypothetical forms, ensures its utility for a wide range of applications, from building robust language models to exploring cross-linguistic phenomena in morphology and syntax.
提供机构:
Mendeley Data
创建时间:
2024-10-22
二维码
社区交流群
二维码
科研交流群
商业服务