five

Swahili Verb Conjugation Dataset: A Comprehensive Analysis of Agglutination and Verb Structure Across Tenses and Persons

收藏
doi.org2025-03-22 收录
下载链接:
http://doi.org/10.17632/rvt89578g5.1
下载链接
链接失效反馈
官方服务:
资源简介:
This Swahili Verb Conjugation Dataset offers a rich and detailed collection of over 319,156 conjugated verb forms, meticulously compiled to capture the complexity of Swahili’s agglutinative verb morphology. Swahili is known for its rich inflectional system, where verbs are modified by adding prefixes and suffixes to encode grammatical information such as tense, aspect, person, number, and mood. The dataset consists of a single CSV file, where each row represents a unique verb root (mzizi_wa_neno) and its conjugated forms across multiple dimensions: Tenses: The dataset captures five core tenses: past, perfect, present, future, and simple present. These tenses play a critical role in Swahili verb conjugation and vary significantly from English tense structures, making this dataset an essential resource for handling these tense markers in NLP tasks. Persons and Numbers: Conjugations are provided for the 1st, 2nd, and 3rd persons, both singular (umoja) and plural (wingi). Each of these persons is conjugated across the five tenses, providing a comprehensive overview of the morphological changes that occur depending on the subject. Moods: The dataset includes the habitual mood (hali_ya_mazoea), as well as other modal forms and auxiliary verbs that are part of Swahili’s verb system, such as kum (ability), kuwa (to be), and various hypothetical forms (e.g., ninge, unge, ange for conditional tense). The columns in the dataset include: Verb Root (mzizi_wa_neno): The base form of the verb from which all conjugated forms are derived. Conjugated Forms: These columns represent the verb conjugations for the 1st, 2nd, and 3rd persons, both singular and plural, across all tenses. For example, nafsi_ya_kwanza_umoja_wakati_uliopita refers to the 1st person singular in the past tense. This dataset not only provides standard conjugations for Swahili verbs but also covers various auxiliary and hypothetical forms. The extensive collection of forms makes this dataset an invaluable resource for researchers interested in Swahili Natural Language Processing (NLP), as it offers the morphological richness needed for tasks like tokenization, lemmatization, and syntactic parsing. Additionally, this dataset is adaptable for linguistic research beyond computational applications. It can be used to study Swahili verb morphology, tense-aspect systems, and cross-linguistic comparisons with other agglutinative languages.

本斯瓦希里动词变位数据集汇聚了超过319,156种动词变位形式,这些形式经过精心编纂,旨在捕捉斯瓦希里语粘着动词形态的复杂性。斯瓦希里语以其丰富的屈折系统而闻名,其中动词通过添加前缀和后缀来编码诸如时态、体、人称、数和语气等语法信息。 数据集包含一个单独的CSV文件,其中每一行代表一个独特的动词词根(mzizi_wa_neno)及其在多个维度上的变位形式。 时态:数据集涵盖了五个核心时态:过去时、完成时、现在时、将来时和简单现在时。这些时态在斯瓦希里语动词变位中扮演着至关重要的角色,并且与英语的时态结构存在显著差异,这使得该数据集成为处理自然语言处理(NLP)任务中这些时态标记的必备资源。 人称和数:提供了第一、第二和第三人称的变位,包括单数(umoja)和复数(wingi)。每个这些人称都跨越了五个时态,全面概述了根据主语发生的形态变化。 语气:数据集包括习惯语气(hali_ya_mazoea),以及其他属于斯瓦希里语动词系统的模态形式和助动词,如kum(能力)、kuwa(是)以及各种假设形式(例如,ninge、unge、ange用于条件时态)。 数据集中的列包括: 动词词根(mzizi_wa_neno):从该动词词根衍生出所有变位形式的基形式。 变位形式:这些列代表第一、第二和第三人称,包括单数和复数,在所有时态下的动词变位。例如,nafsi_ya_kwanza_umoja_wakati_uliopita指的是第一人称单数的过去时。 该数据集不仅提供了斯瓦希里语动词的标准变位,还涵盖了各种助动词和假设形式。形式的广泛集合使得该数据集成为研究斯瓦希里语自然语言处理(NLP)的研究人员宝贵的资源,因为它提供了进行分词、词形还原和句法分析等任务所需的形态丰富性。 此外,该数据集还适用于超出计算应用的语文学研究。它可以用于研究斯瓦希里语动词形态、时态-体系统和与其他粘着语言的跨语言比较。
提供机构:
Mendeley Data
二维码
社区交流群
二维码
科研交流群
商业服务