Sadeed-Tashkeela
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/Misraj/Sadeed_Tashkeela
下载链接
链接失效反馈官方服务:
资源简介:
该数据集来源于Tashkeela语料库和阿拉伯树库(ATB-3),旨在服务于阿拉伯语标音任务。它包含了1,042,698个示例,总计大约5300万字,这些文本在经过标准化处理后,还被分块以保持句法和上下文依赖关系。为确保数据质量,该数据集在预处理阶段纠正了标音错误,对文本进行了规范化处理,并针对训练进行了优化,避免了现有测试集数据泄露的风险。该数据集的规模约为5300万字,所涉及的任务是阿拉伯语标音。
This dataset is sourced from the Tashkeela corpus and the Arabic Treebank (ATB-3), and is designed for Arabic diacritization tasks. It contains 1,042,698 instances with a total word count of approximately 53 million words. After standardization processing, these texts are chunked to preserve syntactic and contextual dependencies. To ensure data quality, the dataset corrects diacritic errors, normalizes the text, optimizes for training during the preprocessing stage, and eliminates the risk of data leakage from existing test sets. The dataset has a total scale of approximately 53 million words, and the target task is Arabic diacritization.



