five

Normalized Arabic Fragments for Inestimable Stemming (NAFIS)

收藏
catalogue.elra.info2025-03-26 收录
下载链接:
https://catalogue.elra.info/en-us/repository/browse/ELRA-W0127/
下载链接
链接失效反馈
官方服务:
资源简介:
Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of sentences, selected to be representative of Arabic stemming tasks and manually annotated. Indeed, NAFIS is:Comprehensive: The content of NAFIS can be generalized to the Arabic language as a whole. Within the stemming issue, to be comprehensive the corpus must contain all possible affix combinations. To reflect this purpose, linguists made an inventory of all Arabic affix combinations. An affix is a prefix-suffix couple that can be agglutinated to a specific word type (noun, verb or particle). Arabic affixes consist of 12 atomic prefixes and 11 atomic suffixes. Their combining generates about 94 prefixes and 73 suffixes (we note that we use the terms affix, prefix and suffix instead of clitic, proclitic and enclitic because they are widely used in the literature). For example the prefix “وَال” (and the) is composed with two atomic prefixes “وَ” (the conjunction “and”) and “لا” (the definite article “the”). Compiled: linguists gathered a set of sentences containing all earlier listed affixes to ensure the comprehensiveness criterion. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). For instance, the following sentence "عليكم بالجد فإنه أساس النجاح" is part of the corpus and contains four affixes combination: 1.[-كم]: the empty prefix associated with the suffix pronoun ‘you’, 2.[بال-]: composed with two atomic prefixes ("ب" the preposition 'with' and “ال” the definite article 'the') and the empty suffix, 3.[ه-ف]: composed with the prefix “ف” (the conjunction “then”) and the suffix “ه” (the pronoun “his”) 4.[ال-]: composed with “ال” the definite article 'the' and the empty suffix.As shown in the extract below, NAFIS is represented according to the TEI standard. Sentences are enclosed within the <phr> tag. A sentence is a set of segments representing words <w>. Since a word can have several stemming solutions (<choice>), each alternative is included within a <form> tag, which contains the prefix, base (root and stem) and suffix morphemes. All alternatives are ordered randomly except the first one, which is the suitable solution when taking the sentence context into consideration. The corpus has the following characteristics:•37 sentences•The average length of sentences is 5,05 words, with the longest being 10 words•Declarative, interrogative, imperative and exclamatory sentences accounted for 37,84%, 32,43%, 16,22% and 13,51% respectively•154 tokens with 5,95 solutions as an average number of stemming solutions

NAFIS(不可估量的词根提取阿拉伯语片段)是一个由精选句子组成的阿拉伯语词根提取黄金标准语料库,这些句子旨在代表阿拉伯语词根提取任务,并经过人工标注。确实,NAFIS具有以下特点: - 广泛性:NAFIS的内容可以推广至整个阿拉伯语语言体系。在词根提取问题上,为了确保广泛性,语料库必须包含所有可能的词缀组合。为了反映这一目的,语言学家编制了所有阿拉伯语词缀组合的清单。词缀是指可以附加到特定词类(名词、动词或小品词)上的前缀-后缀组合。阿拉伯语词缀由12个原子前缀和11个原子后缀组成,它们的组合产生了大约94个前缀和73个后缀(我们注意到,我们使用术语词缀、前缀和后缀而不是附着词、前缀词和后缀词,因为在文献中它们被广泛使用)。例如,前缀“وَال”由两个原子前缀“وَ”(连词“和”)和“لا”(定冠词“这”)组成。 - 编纂性:语言学家收集了一组包含所有先前列出的词缀的句子,以确保广泛性标准。编纂的句子来源于各种来源(诗歌、圣典《古兰经》、书籍和期刊),种类繁多(谚语和格言、文章评论、宗教文本、文学、历史小说)。例如,以下句子“你们要全力以赴,因为这是成功的基础”是语料库的一部分,包含四种词缀组合:1.[-كم]:与后缀代词‘你’相关联的空前缀,2.[بال-]:由两个原子前缀(“ب”介词‘与’和“ال”定冠词‘这’)和空后缀组成,3.[ه-ف]:由前缀“ف”(连词“然后”)和后缀“ه”(代词“他的”)组成,4.[ال-]:由“ال”(定冠词‘这’)和空后缀组成。如以下摘录所示,NAFIS按照TEI标准进行表示。句子被包含在<phr>标签内。一个句子是一组表示单词的片段<w>。由于一个单词可以有多个词根提取方案<choice>,每个替代方案都包含在一个<form>标签内,该标签包含前缀、词干(词根和词干)和词缀语素。所有替代方案除第一个之外都是随机排序的,第一个方案是在考虑句子上下文时适合的解决方案。该语料库具有以下特点:•37个句子•句子的平均长度为5.05个单词,最长为10个单词•陈述句、疑问句、祈使句和感叹句分别占37.84%、32.43%、16.22%和13.51%•154个标记,平均每个标记有5.95个词根提取方案。
提供机构:
catalogue.elra.info
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作