NArabizi树库
收藏arXiv2021-05-31 更新2024-06-21 收录
下载链接:
https://parsiti.github.io/NArabizi/
下载链接
链接失效反馈官方服务:
资源简介:
NArabizi树库是由奥斯陆大学信息学系的研究人员创建的一个包含约1500句的阿尔及利亚方言数据集。该数据集主要从阿尔及利亚报纸的网络论坛和歌曲歌词中提取,涵盖了多种脚本(拉丁、阿拉伯和混合脚本)。数据集经过多层标注,包括词性标注、情感分析和话题分类等。此数据集旨在解决阿尔及利亚方言在自然语言处理中的处理问题,特别是在跨语言转移任务中,探讨语言相似性和脚本差异对性能的影响。
The NArabizi Treebank is an Algerian dialect dataset containing approximately 1,500 sentences, created by researchers from the Department of Informatics at the University of Oslo. This dataset is primarily extracted from online forums of Algerian newspapers and song lyrics, covering multiple scripts including Latin, Arabic and mixed scripts. The dataset has undergone multi-layer annotations such as part-of-speech tagging, sentiment analysis and topic classification. This dataset aims to address the natural language processing issues related to Algerian dialects, and specifically explore the impact of linguistic similarity and script differences on model performance in cross-lingual transfer tasks.
提供机构:
奥斯陆大学信息学系
创建时间:
2021-05-16



