EmanKhater/Tashkeela
收藏Hugging Face2024-08-04 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/EmanKhater/Tashkeela
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是Tashkeela阿拉伯语带音标文本数据集的一个版本,经过清理和分区处理,分为训练集、验证集和测试集。清理过程包括移除XML标签和奇怪符号,修复音标错误,统一矛盾约定,并进行分词处理。数据集特征包括原始完全带音标的阿拉伯语文本、超过300万句子、主要为古典阿拉伯语、空格分隔的标记、90%训练数据、5%验证数据和5%测试数据。
A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets. The cleaning process includes removing the XML tags and strange symbols, fixing diacritics errors, and unifying contradictory conventions. Tokenization is performed to extract Arabic words, resulting in a space-separated tokens file. Sentence segmentation is done at usual punctuations. The dataset contains over 3 million sentences with varying numbers of words, mostly Classical Arabic. The data is split into 90% training, 5% validation, and 5% testing sets.
提供机构:
EmanKhater



