Sadeed_Tashkeela
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/Sadeed_Tashkeela
下载链接
链接失效反馈官方服务:
资源简介:
# 📚 Sadeed Tashkeela Arabic Diacritization Dataset
The **Sadeed** dataset is a large, high-quality Arabic diacritized corpus optimized for training and evaluating Arabic diacritization models.
It is built exclusively from the [Tashkeela corpus](https://sourceforge.net/projects/tashkeela/) for the training set and a refined version of the [Fadel Tashkeela test set](https://github.com/AliOsm/arabic-text-diacritization) for the test set.
## Dataset Overview
- **Training Data**:
- **Source**: Cleaned version of the Tashkeela corpus (original data is ~75 million words, mostly Classical Arabic, with ~1.15% Modern Standard Arabic).
- **Total Examples**: 1,042,698
- **Total Words**: ~53 million
- **Testing Data**:
- **Source**: Corrected version of the Fadel Tashkeela test set, addressing inconsistencies in handling *iltiqā` as-sākinayn* (adjacent consonants without intervening vowels).
- **Total Examples**: 2485 samples
- **Notes**: The test set is refined for phonological consistency according to standard Arabic rules.
- **Features**:
- Fully normalized Arabic text
- Minimal missing diacritics
- Chunked into coherent samples (50–60 words)
- Designed to preserve syntactic and contextual dependencies
---
## Preprocessing Details
### 1. Text Cleaning
- **Diacritization Corrections**:
- Unified diacritization style.
- Corrected diacritization of frequent errors.
- Resolved inconsistencies in *iltiqa' assakinayn* (consonant cluster rules) based on standard phonological rules.
- **Normalization**:
- Applied a comprehensive preprocessing pipeline inspired by [Kuwain](https://github.com/misraj-ai/Kuwain-Arabic-cleaner), preserving non-Arabic characters and symbols.
- **Consistency**:
- Additional normalization steps to ensure stylistic and grammatical consistency across the dataset.
### 2. Text Chunking
- Segmented into samples of **50–60 words** each.
- Used a hierarchical strategy prioritizing natural linguistic breaks, focusing on stronger punctuation first (e.g., sentence-ending punctuation, then commas).
- Designed to preserve the syntactic and contextual coherence of text chunks.
### 3. Dataset Filtering
- **Missing Diacritics**:
- Excluded samples with more than two undiacritized words.
- **Partial Diacritics**:
- Removed samples with three or more partially diacritized words.
- **Test Set Overlap**:
- Eliminated overlapping examples with the Fadel Tashkeela test set, reducing overlap to only **0.4%**.
---
## Usage
This dataset is ideal for:
- **Training**: Using the cleaned Tashkeela corpus to train models for Arabic diacritization.
- **Testing**: Evaluating diacritization systems with the refined Fadel Tashkeela test set.
- Arabic NLP tasks that require fully vocalized texts.
---
## Evaluation Code
The evaluation code for this dataset is available at:
https://github.com/misraj-ai/Sadeed
---
## Citation
If you use this dataset, please cite:
```bibtex
@misc{aldallal2025sadeedadvancingarabicdiacritization,
title={Sadeed: Advancing Arabic Diacritization Through Small Language Model},
author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan},
year={2025},
eprint={2504.21635},
archivePrefix={arXiv},
url={https://huggingface.co/papers/2504.21635},
}
```
---
## License
This dataset is distributed for **research purposes only**.
Please review the original [Tashkeela corpus license](https://sourceforge.net/projects/tashkeela/) for terms of use.
📚 Sadeed Tashkeela 阿拉伯语动符标注数据集
# 📚 Sadeed Tashkeela 阿拉伯语动符标注数据集
**Sadeed** 数据集是一款规模庞大、质量上乘的阿拉伯语动符标注语料库,专为训练与评估阿拉伯语动符标注模型而优化设计。
该数据集的训练集仅源自[塔什基拉语料库(Tashkeela corpus)](https://sourceforge.net/projects/tashkeela/),测试集则经过优化,取自[Fadel Tashkeela测试集](https://github.com/AliOsm/arabic-text-diacritization)的精炼版本。
## 数据集概览
- **训练数据**:
- **数据来源**:经过清洗的塔什基拉语料库(Tashkeela corpus)版本,原始语料包含约7500万词汇,其中绝大多数为古典阿拉伯语,现代标准阿拉伯语占比约1.15%。
- **总样本数**:1,042,698
- **总词汇数**:约5300万
- **测试数据**:
- **数据来源**:经过修正的Fadel Tashkeela测试集版本,修复了处理*iltiqā` as-sākinayn*(相邻无隔元音的辅音连缀)时存在的不一致问题。
- **总样本数**:2485个样本
- **备注**:该测试集依据标准阿拉伯语语音规则进行了优化,以确保语音学层面的一致性。
- **数据集特性**:
- 全归一化阿拉伯语文本
- 动符缺失率极低
- 被切分为连贯样本(每段50~60个词汇)
- 旨在保留文本的句法与上下文依赖关系
---
## 预处理细节
### 1. 文本清洗
- **动符标注修正**:
- 统一动符标注风格
- 修正常见错误的动符标注
- 依据标准语音规则,解决了*iltiqa' assakinayn*(辅音连缀规则)相关的不一致问题
- **归一化处理**:
- 采用源自[库万阿拉伯语清洗工具(Kuwain)](https://github.com/misraj-ai/Kuwain-Arabic-cleaner)的全流程预处理管线,保留非阿拉伯语字符与符号
- **一致性优化**:
- 增设归一化步骤,确保整个数据集在文体与语法层面保持一致
### 2. 文本切分
- 将文本切分为每段**50~60个词汇**的样本
- 采用分层切分策略,优先遵循自然语言停顿断点,先以强标点(如句末标点,其次为逗号)作为切分依据
- 旨在保留切分后文本块的句法与上下文连贯性
### 3. 数据集过滤
- **动符缺失情况过滤**:
- 剔除包含两个以上未标注动符的词汇的样本
- **部分动符标注情况过滤**:
- 移除包含三个及以上部分标注动符的词汇的样本
- **测试集重叠问题处理**:
- 消除与Fadel Tashkeela测试集的重叠样本,将重叠率降至仅**0.4%**
---
## 使用场景
该数据集适用于:
- **训练**:使用经清洗的塔什基拉语料库训练阿拉伯语动符标注模型
- **测试**:使用经过优化的Fadel Tashkeela测试集评估动符标注系统
- 其他需要完全标音文本的阿拉伯语自然语言处理任务
---
## 评估代码
本数据集的评估代码可通过以下链接获取:
https://github.com/misraj-ai/Sadeed
---
## 引用方式
若您使用本数据集,请引用如下文献:
bibtex
@misc{aldallal2025sadeedadvancingarabicdiacritization,
title={Sadeed: Advancing Arabic Diacritization Through Small Language Model},
author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan},
year={2025},
eprint={2504.21635},
archivePrefix={arXiv},
url={https://huggingface.co/papers/2504.21635},
}
---
## 许可证
本数据集仅用于**学术研究用途**。
请查阅原始[塔什基拉语料库许可协议(Tashkeela corpus license)](https://sourceforge.net/projects/tashkeela/)以了解使用条款。
提供机构:
maas
创建时间:
2025-07-07



