five

CTAB: Corpus of Tunisian Arabizi

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4749724
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset has been created between 2017 and 2021 to provide a textual resource that can be used to study the behaviors of Tunisian people in writing Tunisian Arabic (ISO 693-3: aeb) in Latin Script. This corpus is constituted from messages written using Tunisian Arabic Chat Alphabet or Arabizi and is developed to solve the matter of the lack of NLP databases about the use of the Latin Script for transcribing Tunisian Arabic. The messages are automatically pulled using web scraping of Facebook public pages and are kept as they are without any annotation, spelling adjustments or morphological and syntactic labeling. Then, messages that are written in Latin Script but not in Tunisian Arabic are manually eliminated. Finally, every collection of messages that are retrieved from the same Facebook page in the same period is included in the same text file where every message is featured as one line.
创建时间:
2021-05-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作