Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects

Name: Impact of Parallel vs. Non Parallel Corpora on the Identification of Arabic Dialects
Creator: My University
Published: 2024-05-17 11:45:52
License: 暂无描述

DataCite Commons2024-05-17 更新2024-07-03 收录

下载链接：

https://africarxiv.ubuntunet.net/handle/1/523

下载链接

链接失效反馈

官方服务：

资源简介：

In this paper, we conduct a study to evaluate the performance of statistical and neural methods to classify Arabic Dialects (AD). This evaluation is based on two kinds of corpora. The first one is a corpus named PADIC (Parallel Arabic DIalectal Corpus), which is a multi-dialectal corpus composed of six dialects: two Algerian dialects (of Algiers and Annaba cities), Palestinian, Syrian, Tunisian, and Moroccan, in addition to MSA. The second one (AraDial) is a manually collected corpus that contains the same dialects as well as the same number of sentences as PADIC (6412 sentences for each dialect). In our experiments, we used both statistical and neural classifiers, namely: Gaussian Naive Bayes, Bernoulli Naive Bayes, kNN, Logistic Regression, SGD Classifier, Passive Aggressive Classifier, Perceptron, Linear Support Vector, and Convolutional Neural Network Classifiers. We evaluated these classifiers in two setups: training on a parallel corpus (PADIC) and testing on the non-parallel corpus and vice versa. The obtained results have shown that training our system on a non-parallel corpus will give better results, as we achieved a mean score of 92.08\%.

提供机构：

My University

创建时间：

2024-05-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集