SadeedDiac-25
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/SadeedDiac-25
下载链接
链接失效反馈官方服务:
资源简介:
# SadeedDiac-25: A Benchmark for Arabic Diacritization
[Paper](https://huggingface.co/papers/2504.21635)
**SadeedDiac-25** is a comprehensive and linguistically diverse benchmark specifically designed for evaluating Arabic diacritization models. It unifies Modern Standard Arabic (MSA) and Classical Arabic (CA) in a single dataset, addressing key limitations in existing benchmarks.
## Overview
Existing Arabic diacritization benchmarks tend to focus on either Classical Arabic (e.g., Fadel, Abbad) or Modern Standard Arabic (e.g., CATT, WikiNews), with limited domain diversity and quality inconsistencies. SadeedDiac-25 addresses these issues by:
- Combining MSA and CA in one dataset
- Covering diverse domains (e.g., news, religion, politics, sports, culinary arts)
- Ensuring high annotation quality through a multi-stage expert review process
- Avoiding contamination from large-scale pretraining corpora
## Dataset Composition
SadeedDiac-25 consists of 1,200 paragraphs:
- **📘 50% Modern Standard Arabic (MSA)**
- 454 paragraphs of curated original MSA content
- 146 paragraphs from WikiNews
- Length: 40–50 words per paragraph
- **📗 50% Classical Arabic (CA)**
- 📖 600 paragraphs from the Fadel test set
## Evaluation Results
We evaluated several models on SadeedDiac-25, including proprietary LLMs and open-source Arabic models. Evaluation metrics include Diacritic Error Rate (DER), Word Error Rate (WER), and hallucination rates.
The evaluation code for this dataset is available at: https://github.com/misraj-ai/Sadeed
### Evaluation Table
| Model | DER (CE) | WER (CE) | DER (w/o CE) | WER (w/o CE) | Hallucinations |
| ------------------------ | ---------- | ---------- | ------------ | ------------ | -------------- |
| Claude-3-7-Sonnet-Latest | **1.3941** | **4.6718** | **0.7693** | **2.3098** | **0.821** |
| GPT-4 | 3.8645 | 5.2719 | 3.8645 | 10.9274 | 1.0242 |
| Gemini-Flash-2.0 | 3.1926 | 7.9942 | 2.3783 | 5.5044 | 1.1713 |
| *Sadeed* | *7.2915* | *13.7425* | *5.2625* | *9.9245* | *7.1946* |
| Aya-23-8B | 25.6274 | 47.4908 | 19.7584 | 40.2478 | 5.7793 |
| ALLaM-7B-Instruct | 50.3586 | 70.3369 | 39.4100 | 67.0920 | 36.5092 |
| Yehia-7B | 50.8801 | 70.2323 | 39.7677 | 67.1520 | 43.1113 |
| Jais-13B | 78.6820 | 99.7541 | 60.7271 | 99.5702 | 61.0803 |
| Gemma-2-9B | 78.8560 | 99.7928 | 60.9188 | 99.5895 | 86.8771 |
| SILMA-9B-Instruct-v1.0 | 78.6567 | 99.7367 | 60.7106 | 99.5586 | 93.6515 |
> **Note**: CE = Case Ending
## Citation
If you use SadeedDiac-25 in your work, please cite:
## Citation
If you use this dataset, please cite:
```bibtex
@misc{aldallal2025sadeedadvancingarabicdiacritization,
title={Sadeed: Advancing Arabic Diacritization Through Small Language Model},
author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan},
year={2025},
eprint={2504.21635},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.21635},
}
```
## License
📄 This dataset is released under the CC BY-NC-SA 4.0 License.
## Contact
📬 For questions, contact [Misraj-AI](https://misraj.ai/) on Hugging Face.
# SadeedDiac-25:阿拉伯语元音标注基准数据集
[论文链接](https://huggingface.co/papers/2504.21635)
**SadeedDiac-25** 是一款兼具全面性与语言多样性的基准数据集,专为评估阿拉伯语元音标注(Arabic Diacritization)模型而设计。该数据集将现代标准阿拉伯语(Modern Standard Arabic, MSA)与古典阿拉伯语(Classical Arabic, CA)整合至单一数据集当中,解决了现有基准数据集的核心局限。
## 概述
现有阿拉伯语元音标注基准数据集往往仅聚焦于古典阿拉伯语(如Fadel、Abbad数据集)或现代标准阿拉伯语(如CATT、WikiNews数据集),存在领域多样性不足、标注质量参差不齐的问题。SadeedDiac-25通过以下举措破解上述难题:
- 将现代标准阿拉伯语与古典阿拉伯语整合至单一数据集
- 覆盖多元领域(涵盖新闻、宗教、政治、体育、烹饪艺术等场景)
- 通过多阶段专家审核流程保障标注质量
- 规避大规模预训练语料带来的数据污染
## 数据集构成
SadeedDiac-25 共包含1200段文本:
- **📘 50% 现代标准阿拉伯语(Modern Standard Arabic, MSA)**
- 454段经过审慎甄选的原创现代标准阿拉伯语文本
- 146段取自WikiNews的公开文本
- 单段文本长度介于40至50个单词之间
- **📗 50% 古典阿拉伯语(Classical Arabic, CA)**
- 📖 600段来自Fadel测试集的文本
## 评估结果
我们基于SadeedDiac-25对多款模型展开了评估,涵盖闭源大语言模型(Large Language Model, LLM)与开源阿拉伯语专用模型。本次评估采用的指标包括元音标注错误率(Diacritic Error Rate, DER)、词错误率(Word Error Rate, WER)以及幻觉生成率。本数据集的评估代码开源地址为:https://github.com/misraj-ai/Sadeed
### 评估结果表
| 模型名称 | DER(带词尾变格) | WER(带词尾变格) | DER(不带词尾变格) | WER(不带词尾变格) | 幻觉生成率 |
| ------------------------- | ------------------ | ------------------ | ------------------ | ------------------ | -------------- |
| Claude-3-7-Sonnet-Latest | **1.3941** | **4.6718** | **0.7693** | **2.3098** | **0.821** |
| GPT-4 | 3.8645 | 5.2719 | 3.8645 | 10.9274 | 1.0242 |
| Gemini-Flash-2.0 | 3.1926 | 7.9942 | 2.3783 | 5.5044 | 1.1713 |
| *Sadeed* | *7.2915* | *13.7425* | *5.2625* | *9.9245* | *7.1946* |
| Aya-23-8B | 25.6274 | 47.4908 | 19.7584 | 40.2478 | 5.7793 |
| ALLaM-7B-Instruct | 50.3586 | 70.3369 | 39.4100 | 67.0920 | 36.5092 |
| Yehia-7B | 50.8801 | 70.2323 | 39.7677 | 67.1520 | 43.1113 |
| Jais-13B | 78.6820 | 99.7541 | 60.7271 | 99.5702 | 61.0803 |
| Gemma-2-9B | 78.8560 | 99.7928 | 60.9188 | 99.5895 | 86.8771 |
| SILMA-9B-Instruct-v1.0 | 78.6567 | 99.7367 | 60.7106 | 99.5586 | 93.6515 |
> **注**:CE即词尾变格(Case Ending)
## 引用格式
若您在研究工作中使用SadeedDiac-25,请引用如下文献:
## 引用格式
若您在研究工作中使用本数据集,请引用如下文献:
bibtex
@misc{aldallal2025sadeedadvancingarabicdiacritization,
title={Sadeed: Advancing Arabic Diacritization Through Small Language Model},
author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan},
year={2025},
eprint={2504.21635},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.21635},
}
## 许可协议
📄 本数据集采用CC BY-NC-SA 4.0许可协议发布。
## 联系方式
📬 如有疑问,请在Hugging Face平台联系[Misraj-AI](https://misraj.ai/).
提供机构:
maas
创建时间:
2025-07-07



