five

Sadeed_Tashkeela

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/Sadeed_Tashkeela
下载链接
链接失效反馈
官方服务:
资源简介:
# 📚 Sadeed Tashkeela Arabic Diacritization Dataset The **Sadeed** dataset is a large, high-quality Arabic diacritized corpus optimized for training and evaluating Arabic diacritization models. It is built exclusively from the [Tashkeela corpus](https://sourceforge.net/projects/tashkeela/) for the training set and a refined version of the [Fadel Tashkeela test set](https://github.com/AliOsm/arabic-text-diacritization) for the test set. ## Dataset Overview - **Training Data**: - **Source**: Cleaned version of the Tashkeela corpus (original data is ~75 million words, mostly Classical Arabic, with ~1.15% Modern Standard Arabic). - **Total Examples**: 1,042,698 - **Total Words**: ~53 million - **Testing Data**: - **Source**: Corrected version of the Fadel Tashkeela test set, addressing inconsistencies in handling *iltiqā` as-sākinayn* (adjacent consonants without intervening vowels). - **Total Examples**: 2485 samples - **Notes**: The test set is refined for phonological consistency according to standard Arabic rules. - **Features**: - Fully normalized Arabic text - Minimal missing diacritics - Chunked into coherent samples (50–60 words) - Designed to preserve syntactic and contextual dependencies --- ## Preprocessing Details ### 1. Text Cleaning - **Diacritization Corrections**: - Unified diacritization style. - Corrected diacritization of frequent errors. - Resolved inconsistencies in *iltiqa' assakinayn* (consonant cluster rules) based on standard phonological rules. - **Normalization**: - Applied a comprehensive preprocessing pipeline inspired by [Kuwain](https://github.com/misraj-ai/Kuwain-Arabic-cleaner), preserving non-Arabic characters and symbols. - **Consistency**: - Additional normalization steps to ensure stylistic and grammatical consistency across the dataset. ### 2. Text Chunking - Segmented into samples of **50–60 words** each. - Used a hierarchical strategy prioritizing natural linguistic breaks, focusing on stronger punctuation first (e.g., sentence-ending punctuation, then commas). - Designed to preserve the syntactic and contextual coherence of text chunks. ### 3. Dataset Filtering - **Missing Diacritics**: - Excluded samples with more than two undiacritized words. - **Partial Diacritics**: - Removed samples with three or more partially diacritized words. - **Test Set Overlap**: - Eliminated overlapping examples with the Fadel Tashkeela test set, reducing overlap to only **0.4%**. --- ## Usage This dataset is ideal for: - **Training**: Using the cleaned Tashkeela corpus to train models for Arabic diacritization. - **Testing**: Evaluating diacritization systems with the refined Fadel Tashkeela test set. - Arabic NLP tasks that require fully vocalized texts. --- ## Evaluation Code The evaluation code for this dataset is available at: https://github.com/misraj-ai/Sadeed --- ## Citation If you use this dataset, please cite: ```bibtex @misc{aldallal2025sadeedadvancingarabicdiacritization, title={Sadeed: Advancing Arabic Diacritization Through Small Language Model}, author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan}, year={2025}, eprint={2504.21635}, archivePrefix={arXiv}, url={https://huggingface.co/papers/2504.21635}, } ``` --- ## License This dataset is distributed for **research purposes only**. Please review the original [Tashkeela corpus license](https://sourceforge.net/projects/tashkeela/) for terms of use.

📚 Sadeed Tashkeela 阿拉伯语动符标注数据集 # 📚 Sadeed Tashkeela 阿拉伯语动符标注数据集 **Sadeed** 数据集是一款规模庞大、质量上乘的阿拉伯语动符标注语料库,专为训练与评估阿拉伯语动符标注模型而优化设计。 该数据集的训练集仅源自[塔什基拉语料库(Tashkeela corpus)](https://sourceforge.net/projects/tashkeela/),测试集则经过优化,取自[Fadel Tashkeela测试集](https://github.com/AliOsm/arabic-text-diacritization)的精炼版本。 ## 数据集概览 - **训练数据**: - **数据来源**:经过清洗的塔什基拉语料库(Tashkeela corpus)版本,原始语料包含约7500万词汇,其中绝大多数为古典阿拉伯语,现代标准阿拉伯语占比约1.15%。 - **总样本数**:1,042,698 - **总词汇数**:约5300万 - **测试数据**: - **数据来源**:经过修正的Fadel Tashkeela测试集版本,修复了处理*iltiqā` as-sākinayn*(相邻无隔元音的辅音连缀)时存在的不一致问题。 - **总样本数**:2485个样本 - **备注**:该测试集依据标准阿拉伯语语音规则进行了优化,以确保语音学层面的一致性。 - **数据集特性**: - 全归一化阿拉伯语文本 - 动符缺失率极低 - 被切分为连贯样本(每段50~60个词汇) - 旨在保留文本的句法与上下文依赖关系 --- ## 预处理细节 ### 1. 文本清洗 - **动符标注修正**: - 统一动符标注风格 - 修正常见错误的动符标注 - 依据标准语音规则,解决了*iltiqa' assakinayn*(辅音连缀规则)相关的不一致问题 - **归一化处理**: - 采用源自[库万阿拉伯语清洗工具(Kuwain)](https://github.com/misraj-ai/Kuwain-Arabic-cleaner)的全流程预处理管线,保留非阿拉伯语字符与符号 - **一致性优化**: - 增设归一化步骤,确保整个数据集在文体与语法层面保持一致 ### 2. 文本切分 - 将文本切分为每段**50~60个词汇**的样本 - 采用分层切分策略,优先遵循自然语言停顿断点,先以强标点(如句末标点,其次为逗号)作为切分依据 - 旨在保留切分后文本块的句法与上下文连贯性 ### 3. 数据集过滤 - **动符缺失情况过滤**: - 剔除包含两个以上未标注动符的词汇的样本 - **部分动符标注情况过滤**: - 移除包含三个及以上部分标注动符的词汇的样本 - **测试集重叠问题处理**: - 消除与Fadel Tashkeela测试集的重叠样本,将重叠率降至仅**0.4%** --- ## 使用场景 该数据集适用于: - **训练**:使用经清洗的塔什基拉语料库训练阿拉伯语动符标注模型 - **测试**:使用经过优化的Fadel Tashkeela测试集评估动符标注系统 - 其他需要完全标音文本的阿拉伯语自然语言处理任务 --- ## 评估代码 本数据集的评估代码可通过以下链接获取: https://github.com/misraj-ai/Sadeed --- ## 引用方式 若您使用本数据集,请引用如下文献: bibtex @misc{aldallal2025sadeedadvancingarabicdiacritization, title={Sadeed: Advancing Arabic Diacritization Through Small Language Model}, author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan}, year={2025}, eprint={2504.21635}, archivePrefix={arXiv}, url={https://huggingface.co/papers/2504.21635}, } --- ## 许可证 本数据集仅用于**学术研究用途**。 请查阅原始[塔什基拉语料库许可协议(Tashkeela corpus license)](https://sourceforge.net/projects/tashkeela/)以了解使用条款。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作