Sadeed_Tashkeela

Name: Sadeed_Tashkeela
Creator: maas
Published: 2025-12-05 16:41:09
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/Misraj/Sadeed_Tashkeela

下载链接

链接失效反馈

官方服务：

资源简介：

# 📚 Sadeed Tashkeela Arabic Diacritization Dataset The **Sadeed** dataset is a large, high-quality Arabic diacritized corpus optimized for training and evaluating Arabic diacritization models. It is built exclusively from the [Tashkeela corpus](https://sourceforge.net/projects/tashkeela/) for the training set and a refined version of the [Fadel Tashkeela test set](https://github.com/AliOsm/arabic-text-diacritization) for the test set. ## Dataset Overview - **Training Data**: - **Source**: Cleaned version of the Tashkeela corpus (original data is ~75 million words, mostly Classical Arabic, with ~1.15% Modern Standard Arabic). - **Total Examples**: 1,042,698 - **Total Words**: ~53 million - **Testing Data**: - **Source**: Corrected version of the Fadel Tashkeela test set, addressing inconsistencies in handling *iltiqā` as-sākinayn* (adjacent consonants without intervening vowels). - **Total Examples**: 2485 samples - **Notes**: The test set is refined for phonological consistency according to standard Arabic rules. - **Features**: - Fully normalized Arabic text - Minimal missing diacritics - Chunked into coherent samples (50–60 words) - Designed to preserve syntactic and contextual dependencies --- ## Preprocessing Details ### 1. Text Cleaning - **Diacritization Corrections**: - Unified diacritization style. - Corrected diacritization of frequent errors. - Resolved inconsistencies in *iltiqa' assakinayn* (consonant cluster rules) based on standard phonological rules. - **Normalization**: - Applied a comprehensive preprocessing pipeline inspired by [Kuwain](https://github.com/misraj-ai/Kuwain-Arabic-cleaner), preserving non-Arabic characters and symbols. - **Consistency**: - Additional normalization steps to ensure stylistic and grammatical consistency across the dataset. ### 2. Text Chunking - Segmented into samples of **50–60 words** each. - Used a hierarchical strategy prioritizing natural linguistic breaks, focusing on stronger punctuation first (e.g., sentence-ending punctuation, then commas). - Designed to preserve the syntactic and contextual coherence of text chunks. ### 3. Dataset Filtering - **Missing Diacritics**: - Excluded samples with more than two undiacritized words. - **Partial Diacritics**: - Removed samples with three or more partially diacritized words. - **Test Set Overlap**: - Eliminated overlapping examples with the Fadel Tashkeela test set, reducing overlap to only **0.4%**. --- ## Usage This dataset is ideal for: - **Training**: Using the cleaned Tashkeela corpus to train models for Arabic diacritization. - **Testing**: Evaluating diacritization systems with the refined Fadel Tashkeela test set. - Arabic NLP tasks that require fully vocalized texts. --- ## Evaluation Code The evaluation code for this dataset is available at: https://github.com/misraj-ai/Sadeed --- ## Citation If you use this dataset, please cite: ```bibtex @misc{aldallal2025sadeedadvancingarabicdiacritization, title={Sadeed: Advancing Arabic Diacritization Through Small Language Model}, author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan}, year={2025}, eprint={2504.21635}, archivePrefix={arXiv}, url={https://huggingface.co/papers/2504.21635}, } ``` --- ## License This dataset is distributed for **research purposes only**. Please review the original [Tashkeela corpus license](https://sourceforge.net/projects/tashkeela/) for terms of use.

📚 Sadeed Tashkeela 阿拉伯语动符标注数据集 # 📚 Sadeed Tashkeela 阿拉伯语动符标注数据集 **Sadeed** 数据集是一款规模庞大、质量上乘的阿拉伯语动符标注语料库，专为训练与评估阿拉伯语动符标注模型而优化设计。该数据集的训练集仅源自[塔什基拉语料库（Tashkeela corpus）](https://sourceforge.net/projects/tashkeela/)，测试集则经过优化，取自[Fadel Tashkeela测试集](https://github.com/AliOsm/arabic-text-diacritization)的精炼版本。 ## 数据集概览 - **训练数据**： - **数据来源**：经过清洗的塔什基拉语料库（Tashkeela corpus）版本，原始语料包含约7500万词汇，其中绝大多数为古典阿拉伯语，现代标准阿拉伯语占比约1.15%。 - **总样本数**：1,042,698 - **总词汇数**：约5300万 - **测试数据**： - **数据来源**：经过修正的Fadel Tashkeela测试集版本，修复了处理*iltiqā` as-sākinayn*（相邻无隔元音的辅音连缀）时存在的不一致问题。 - **总样本数**：2485个样本 - **备注**：该测试集依据标准阿拉伯语语音规则进行了优化，以确保语音学层面的一致性。 - **数据集特性**： - 全归一化阿拉伯语文本 - 动符缺失率极低 - 被切分为连贯样本（每段50~60个词汇） - 旨在保留文本的句法与上下文依赖关系 --- ## 预处理细节 ### 1. 文本清洗 - **动符标注修正**： - 统一动符标注风格 - 修正常见错误的动符标注 - 依据标准语音规则，解决了*iltiqa' assakinayn*（辅音连缀规则）相关的不一致问题 - **归一化处理**： - 采用源自[库万阿拉伯语清洗工具（Kuwain）](https://github.com/misraj-ai/Kuwain-Arabic-cleaner)的全流程预处理管线，保留非阿拉伯语字符与符号 - **一致性优化**： - 增设归一化步骤，确保整个数据集在文体与语法层面保持一致 ### 2. 文本切分 - 将文本切分为每段**50~60个词汇**的样本 - 采用分层切分策略，优先遵循自然语言停顿断点，先以强标点（如句末标点，其次为逗号）作为切分依据 - 旨在保留切分后文本块的句法与上下文连贯性 ### 3. 数据集过滤 - **动符缺失情况过滤**： - 剔除包含两个以上未标注动符的词汇的样本 - **部分动符标注情况过滤**： - 移除包含三个及以上部分标注动符的词汇的样本 - **测试集重叠问题处理**： - 消除与Fadel Tashkeela测试集的重叠样本，将重叠率降至仅**0.4%** --- ## 使用场景该数据集适用于： - **训练**：使用经清洗的塔什基拉语料库训练阿拉伯语动符标注模型 - **测试**：使用经过优化的Fadel Tashkeela测试集评估动符标注系统 - 其他需要完全标音文本的阿拉伯语自然语言处理任务 --- ## 评估代码本数据集的评估代码可通过以下链接获取： https://github.com/misraj-ai/Sadeed --- ## 引用方式若您使用本数据集，请引用如下文献： bibtex @misc{aldallal2025sadeedadvancingarabicdiacritization, title={Sadeed: Advancing Arabic Diacritization Through Small Language Model}, author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan}, year={2025}, eprint={2504.21635}, archivePrefix={arXiv}, url={https://huggingface.co/papers/2504.21635}, } --- ## 许可证本数据集仅用于**学术研究用途**。请查阅原始[塔什基拉语料库许可协议（Tashkeela corpus license）](https://sourceforge.net/projects/tashkeela/)以了解使用条款。

提供机构：

maas

创建时间：

2025-07-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集