下载链接：

https://modelscope.cn/datasets/Misraj/SadeedDiac-25

下载链接

链接失效反馈

官方服务：

资源简介：

# SadeedDiac-25: A Benchmark for Arabic Diacritization [Paper](https://huggingface.co/papers/2504.21635) **SadeedDiac-25** is a comprehensive and linguistically diverse benchmark specifically designed for evaluating Arabic diacritization models. It unifies Modern Standard Arabic (MSA) and Classical Arabic (CA) in a single dataset, addressing key limitations in existing benchmarks. ## Overview Existing Arabic diacritization benchmarks tend to focus on either Classical Arabic (e.g., Fadel, Abbad) or Modern Standard Arabic (e.g., CATT, WikiNews), with limited domain diversity and quality inconsistencies. SadeedDiac-25 addresses these issues by: - Combining MSA and CA in one dataset - Covering diverse domains (e.g., news, religion, politics, sports, culinary arts) - Ensuring high annotation quality through a multi-stage expert review process - Avoiding contamination from large-scale pretraining corpora ## Dataset Composition SadeedDiac-25 consists of 1,200 paragraphs: - **📘 50% Modern Standard Arabic (MSA)** - 454 paragraphs of curated original MSA content - 146 paragraphs from WikiNews - Length: 40–50 words per paragraph - **📗 50% Classical Arabic (CA)** - 📖 600 paragraphs from the Fadel test set ## Evaluation Results We evaluated several models on SadeedDiac-25, including proprietary LLMs and open-source Arabic models. Evaluation metrics include Diacritic Error Rate (DER), Word Error Rate (WER), and hallucination rates. The evaluation code for this dataset is available at: https://github.com/misraj-ai/Sadeed ### Evaluation Table | Model | DER (CE) | WER (CE) | DER (w/o CE) | WER (w/o CE) | Hallucinations | | ------------------------ | ---------- | ---------- | ------------ | ------------ | -------------- | | Claude-3-7-Sonnet-Latest | **1.3941** | **4.6718** | **0.7693** | **2.3098** | **0.821** | | GPT-4 | 3.8645 | 5.2719 | 3.8645 | 10.9274 | 1.0242 | | Gemini-Flash-2.0 | 3.1926 | 7.9942 | 2.3783 | 5.5044 | 1.1713 | | *Sadeed* | *7.2915* | *13.7425* | *5.2625* | *9.9245* | *7.1946* | | Aya-23-8B | 25.6274 | 47.4908 | 19.7584 | 40.2478 | 5.7793 | | ALLaM-7B-Instruct | 50.3586 | 70.3369 | 39.4100 | 67.0920 | 36.5092 | | Yehia-7B | 50.8801 | 70.2323 | 39.7677 | 67.1520 | 43.1113 | | Jais-13B | 78.6820 | 99.7541 | 60.7271 | 99.5702 | 61.0803 | | Gemma-2-9B | 78.8560 | 99.7928 | 60.9188 | 99.5895 | 86.8771 | | SILMA-9B-Instruct-v1.0 | 78.6567 | 99.7367 | 60.7106 | 99.5586 | 93.6515 | > **Note**: CE = Case Ending ## Citation If you use SadeedDiac-25 in your work, please cite: ## Citation If you use this dataset, please cite: ```bibtex @misc{aldallal2025sadeedadvancingarabicdiacritization, title={Sadeed: Advancing Arabic Diacritization Through Small Language Model}, author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan}, year={2025}, eprint={2504.21635}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.21635}, } ``` ## License 📄 This dataset is released under the CC BY-NC-SA 4.0 License. ## Contact 📬 For questions, contact [Misraj-AI](https://misraj.ai/) on Hugging Face.

# SadeedDiac-25：阿拉伯语元音标注基准数据集 [论文链接](https://huggingface.co/papers/2504.21635) **SadeedDiac-25** 是一款兼具全面性与语言多样性的基准数据集，专为评估阿拉伯语元音标注（Arabic Diacritization）模型而设计。该数据集将现代标准阿拉伯语（Modern Standard Arabic, MSA）与古典阿拉伯语（Classical Arabic, CA）整合至单一数据集当中，解决了现有基准数据集的核心局限。 ## 概述现有阿拉伯语元音标注基准数据集往往仅聚焦于古典阿拉伯语（如Fadel、Abbad数据集）或现代标准阿拉伯语（如CATT、WikiNews数据集），存在领域多样性不足、标注质量参差不齐的问题。SadeedDiac-25通过以下举措破解上述难题： - 将现代标准阿拉伯语与古典阿拉伯语整合至单一数据集 - 覆盖多元领域（涵盖新闻、宗教、政治、体育、烹饪艺术等场景） - 通过多阶段专家审核流程保障标注质量 - 规避大规模预训练语料带来的数据污染 ## 数据集构成 SadeedDiac-25 共包含1200段文本： - **📘 50% 现代标准阿拉伯语（Modern Standard Arabic, MSA）** - 454段经过审慎甄选的原创现代标准阿拉伯语文本 - 146段取自WikiNews的公开文本 - 单段文本长度介于40至50个单词之间 - **📗 50% 古典阿拉伯语（Classical Arabic, CA）** - 📖 600段来自Fadel测试集的文本 ## 评估结果我们基于SadeedDiac-25对多款模型展开了评估，涵盖闭源大语言模型（Large Language Model, LLM）与开源阿拉伯语专用模型。本次评估采用的指标包括元音标注错误率（Diacritic Error Rate, DER）、词错误率（Word Error Rate, WER）以及幻觉生成率。本数据集的评估代码开源地址为：https://github.com/misraj-ai/Sadeed ### 评估结果表 | 模型名称 | DER（带词尾变格） | WER（带词尾变格） | DER（不带词尾变格） | WER（不带词尾变格） | 幻觉生成率 | | ------------------------- | ------------------ | ------------------ | ------------------ | ------------------ | -------------- | | Claude-3-7-Sonnet-Latest | **1.3941** | **4.6718** | **0.7693** | **2.3098** | **0.821** | | GPT-4 | 3.8645 | 5.2719 | 3.8645 | 10.9274 | 1.0242 | | Gemini-Flash-2.0 | 3.1926 | 7.9942 | 2.3783 | 5.5044 | 1.1713 | | *Sadeed* | *7.2915* | *13.7425* | *5.2625* | *9.9245* | *7.1946* | | Aya-23-8B | 25.6274 | 47.4908 | 19.7584 | 40.2478 | 5.7793 | | ALLaM-7B-Instruct | 50.3586 | 70.3369 | 39.4100 | 67.0920 | 36.5092 | | Yehia-7B | 50.8801 | 70.2323 | 39.7677 | 67.1520 | 43.1113 | | Jais-13B | 78.6820 | 99.7541 | 60.7271 | 99.5702 | 61.0803 | | Gemma-2-9B | 78.8560 | 99.7928 | 60.9188 | 99.5895 | 86.8771 | | SILMA-9B-Instruct-v1.0 | 78.6567 | 99.7367 | 60.7106 | 99.5586 | 93.6515 | > **注**：CE即词尾变格（Case Ending） ## 引用格式若您在研究工作中使用SadeedDiac-25，请引用如下文献： ## 引用格式若您在研究工作中使用本数据集，请引用如下文献： bibtex @misc{aldallal2025sadeedadvancingarabicdiacritization, title={Sadeed: Advancing Arabic Diacritization Through Small Language Model}, author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan}, year={2025}, eprint={2504.21635}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.21635}, } ## 许可协议 📄 本数据集采用CC BY-NC-SA 4.0许可协议发布。 ## 联系方式 📬 如有疑问，请在Hugging Face平台联系[Misraj-AI](https://misraj.ai/).

应用场景：