Misraj/Sadeed_Tashkeela
收藏Hugging Face2025-05-20 更新2025-05-31 收录
下载链接:
https://hf-mirror.com/datasets/Misraj/Sadeed_Tashkeela
下载链接
链接失效反馈官方服务:
资源简介:
Sadeed数据集是一个面向阿拉伯语标音化模型训练和评估的大型高质量阿拉伯语标音语料库。该数据集完全由Tashkeela语料库的训练集和一个精炼的Fadel Tashkeela测试集构建而成。数据集经过了彻底的清洗和标准化,包括统一标音风格、修正常见错误、处理辅音簇规则等。数据集以50-60词的块进行分段,以保持文本的句法和上下文连贯性。适合用于训练标音化模型、评估标音系统以及需要完全标音文本的阿拉伯语自然语言处理任务。
The Sadeed dataset is a large, high-quality Arabic diacritized corpus optimized for training and evaluating Arabic diacritization models. It is built exclusively from the Tashkeela corpus for the training set and a refined version of the Fadel Tashkeela test set. The dataset has undergone thorough cleaning and normalization, including unifying the diacritization style, correcting common errors, and handling consonant cluster rules. It is chunked into segments of 50-60 words to preserve syntactic and contextual coherence. It is suitable for training diacritization models, evaluating diacritization systems, and Arabic NLP tasks that require fully vocalized texts.
提供机构:
Misraj



