Abdou/arabic-tashkeel-dataset
收藏Hugging Face2024-10-28 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/Abdou/arabic-tashkeel-dataset
下载链接
链接失效反馈官方服务:
资源简介:
Arabic Tashkeel数据集是一个较大的阿拉伯语文本数据集,主要用于训练模型以自动为阿拉伯语文本添加元音符号(tashkeel)。数据集包含非元音化文本、元音化文本和来源三个特征,分为训练集、验证集和测试集。数据集的主要来源包括Tashkeela数据集、Shamela图书馆、维基百科文章、APCD和APCDv2数据集、Ashaar_diacritized和Ashaar_meter数据集、Quran的不同riwayat以及Leeds大学和King Saud大学的Hadith语料库。数据集的局限性在于其主要由古典阿拉伯语的宗教文本组成,可能不适用于现代标准阿拉伯语。
The Arabic Tashkeel Dataset is a fairly large dataset primarily used to train models to automatically add diacritics (tashkeel) to Arabic text. The dataset is gathered from five main sources: tashkeela, shamela, wikipedia, ashaar, quran-riwayat, and hadith. It is divided into training, validation, and test sets, containing 1,463,790, 30,181, and 15,091 samples respectively. The main use of this dataset is to train models for automatic diacritization of Arabic text, but it has limitations as over 90% of the dataset consists primarily of religious texts in Classical Arabic, making models trained on this data potentially less effective with Modern Standard Arabic.
提供机构:
Abdou



