five

SadeedDiac-25

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/SadeedDiac-25
下载链接
链接失效反馈
官方服务:
资源简介:
# SadeedDiac-25: A Benchmark for Arabic Diacritization [Paper](https://huggingface.co/papers/2504.21635) **SadeedDiac-25** is a comprehensive and linguistically diverse benchmark specifically designed for evaluating Arabic diacritization models. It unifies Modern Standard Arabic (MSA) and Classical Arabic (CA) in a single dataset, addressing key limitations in existing benchmarks. ## Overview Existing Arabic diacritization benchmarks tend to focus on either Classical Arabic (e.g., Fadel, Abbad) or Modern Standard Arabic (e.g., CATT, WikiNews), with limited domain diversity and quality inconsistencies. SadeedDiac-25 addresses these issues by: - Combining MSA and CA in one dataset - Covering diverse domains (e.g., news, religion, politics, sports, culinary arts) - Ensuring high annotation quality through a multi-stage expert review process - Avoiding contamination from large-scale pretraining corpora ## Dataset Composition SadeedDiac-25 consists of 1,200 paragraphs: - **📘 50% Modern Standard Arabic (MSA)** - 454 paragraphs of curated original MSA content - 146 paragraphs from WikiNews - Length: 40–50 words per paragraph - **📗 50% Classical Arabic (CA)** - 📖 600 paragraphs from the Fadel test set ## Evaluation Results We evaluated several models on SadeedDiac-25, including proprietary LLMs and open-source Arabic models. Evaluation metrics include Diacritic Error Rate (DER), Word Error Rate (WER), and hallucination rates. The evaluation code for this dataset is available at: https://github.com/misraj-ai/Sadeed ### Evaluation Table | Model | DER (CE) | WER (CE) | DER (w/o CE) | WER (w/o CE) | Hallucinations | | ------------------------ | ---------- | ---------- | ------------ | ------------ | -------------- | | Claude-3-7-Sonnet-Latest | **1.3941** | **4.6718** | **0.7693** | **2.3098** | **0.821** | | GPT-4 | 3.8645 | 5.2719 | 3.8645 | 10.9274 | 1.0242 | | Gemini-Flash-2.0 | 3.1926 | 7.9942 | 2.3783 | 5.5044 | 1.1713 | | *Sadeed* | *7.2915* | *13.7425* | *5.2625* | *9.9245* | *7.1946* | | Aya-23-8B | 25.6274 | 47.4908 | 19.7584 | 40.2478 | 5.7793 | | ALLaM-7B-Instruct | 50.3586 | 70.3369 | 39.4100 | 67.0920 | 36.5092 | | Yehia-7B | 50.8801 | 70.2323 | 39.7677 | 67.1520 | 43.1113 | | Jais-13B | 78.6820 | 99.7541 | 60.7271 | 99.5702 | 61.0803 | | Gemma-2-9B | 78.8560 | 99.7928 | 60.9188 | 99.5895 | 86.8771 | | SILMA-9B-Instruct-v1.0 | 78.6567 | 99.7367 | 60.7106 | 99.5586 | 93.6515 | > **Note**: CE = Case Ending ## Citation If you use SadeedDiac-25 in your work, please cite: ## Citation If you use this dataset, please cite: ```bibtex @misc{aldallal2025sadeedadvancingarabicdiacritization, title={Sadeed: Advancing Arabic Diacritization Through Small Language Model}, author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan}, year={2025}, eprint={2504.21635}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.21635}, } ``` ## License 📄 This dataset is released under the CC BY-NC-SA 4.0 License. ## Contact 📬 For questions, contact [Misraj-AI](https://misraj.ai/) on Hugging Face.

# SadeedDiac-25:阿拉伯语元音标注基准数据集 [论文链接](https://huggingface.co/papers/2504.21635) **SadeedDiac-25** 是一款兼具全面性与语言多样性的基准数据集,专为评估阿拉伯语元音标注(Arabic Diacritization)模型而设计。该数据集将现代标准阿拉伯语(Modern Standard Arabic, MSA)与古典阿拉伯语(Classical Arabic, CA)整合至单一数据集当中,解决了现有基准数据集的核心局限。 ## 概述 现有阿拉伯语元音标注基准数据集往往仅聚焦于古典阿拉伯语(如Fadel、Abbad数据集)或现代标准阿拉伯语(如CATT、WikiNews数据集),存在领域多样性不足、标注质量参差不齐的问题。SadeedDiac-25通过以下举措破解上述难题: - 将现代标准阿拉伯语与古典阿拉伯语整合至单一数据集 - 覆盖多元领域(涵盖新闻、宗教、政治、体育、烹饪艺术等场景) - 通过多阶段专家审核流程保障标注质量 - 规避大规模预训练语料带来的数据污染 ## 数据集构成 SadeedDiac-25 共包含1200段文本: - **📘 50% 现代标准阿拉伯语(Modern Standard Arabic, MSA)** - 454段经过审慎甄选的原创现代标准阿拉伯语文本 - 146段取自WikiNews的公开文本 - 单段文本长度介于40至50个单词之间 - **📗 50% 古典阿拉伯语(Classical Arabic, CA)** - 📖 600段来自Fadel测试集的文本 ## 评估结果 我们基于SadeedDiac-25对多款模型展开了评估,涵盖闭源大语言模型(Large Language Model, LLM)与开源阿拉伯语专用模型。本次评估采用的指标包括元音标注错误率(Diacritic Error Rate, DER)、词错误率(Word Error Rate, WER)以及幻觉生成率。本数据集的评估代码开源地址为:https://github.com/misraj-ai/Sadeed ### 评估结果表 | 模型名称 | DER(带词尾变格) | WER(带词尾变格) | DER(不带词尾变格) | WER(不带词尾变格) | 幻觉生成率 | | ------------------------- | ------------------ | ------------------ | ------------------ | ------------------ | -------------- | | Claude-3-7-Sonnet-Latest | **1.3941** | **4.6718** | **0.7693** | **2.3098** | **0.821** | | GPT-4 | 3.8645 | 5.2719 | 3.8645 | 10.9274 | 1.0242 | | Gemini-Flash-2.0 | 3.1926 | 7.9942 | 2.3783 | 5.5044 | 1.1713 | | *Sadeed* | *7.2915* | *13.7425* | *5.2625* | *9.9245* | *7.1946* | | Aya-23-8B | 25.6274 | 47.4908 | 19.7584 | 40.2478 | 5.7793 | | ALLaM-7B-Instruct | 50.3586 | 70.3369 | 39.4100 | 67.0920 | 36.5092 | | Yehia-7B | 50.8801 | 70.2323 | 39.7677 | 67.1520 | 43.1113 | | Jais-13B | 78.6820 | 99.7541 | 60.7271 | 99.5702 | 61.0803 | | Gemma-2-9B | 78.8560 | 99.7928 | 60.9188 | 99.5895 | 86.8771 | | SILMA-9B-Instruct-v1.0 | 78.6567 | 99.7367 | 60.7106 | 99.5586 | 93.6515 | > **注**:CE即词尾变格(Case Ending) ## 引用格式 若您在研究工作中使用SadeedDiac-25,请引用如下文献: ## 引用格式 若您在研究工作中使用本数据集,请引用如下文献: bibtex @misc{aldallal2025sadeedadvancingarabicdiacritization, title={Sadeed: Advancing Arabic Diacritization Through Small Language Model}, author={Zeina Aldallal and Sara Chrouf and Khalil Hennara and Mohamed Motaism Hamed and Muhammad Hreden and Safwan AlModhayan}, year={2025}, eprint={2504.21635}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.21635}, } ## 许可协议 📄 本数据集采用CC BY-NC-SA 4.0许可协议发布。 ## 联系方式 📬 如有疑问,请在Hugging Face平台联系[Misraj-AI](https://misraj.ai/).
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作