Data_Sheet_1_Hate speech detection with ADHAR: a multi-dialectal hate speech corpus in Arabic.pdf

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/Data_Sheet_1_Hate_speech_detection_with_ADHAR_a_multi-dialectal_hate_speech_corpus_in_Arabic_pdf/25931464

下载链接

链接失效反馈

官方服务：

资源简介：

Hate speech detection in Arabic poses a complex challenge due to the dialectal diversity across the Arab world. Most existing hate speech datasets for Arabic cover only one dialect or one hate speech category. They also lack balance across dialects, topics, and hate/non-hate classes. In this paper, we address this gap by presenting ADHAR—a comprehensive multi-dialect, multi-category hate speech corpus for Arabic. ADHAR contains 70,369 words and spans four language variants: Modern Standard Arabic (MSA), Egyptian, Levantine, Gulf and Maghrebi. It covers four key hate speech categories: nationality, religion, ethnicity, and race. A major contribution is that ADHAR is carefully curated to maintain balance across dialects, categories, and hate/non-hate classes to enable unbiased dataset evaluation. We describe the systematic data collection methodology, followed by a rigorous annotation process involving multiple annotators per dialect. Extensive qualitative and quantitative analyses demonstrate the quality and usefulness of ADHAR. Our experiments with various classical and deep learning models demonstrate that our dataset enables the development of robust hate speech classifiers for Arabic, achieving accuracy and F1-scores of up to 90% for hate speech detection and up to 92% for category detection. When trained with Arabert, we achieved an accuracy and F1-score of 94% for hate speech detection, as well as 95% for the category detection.

阿拉伯语仇恨言论检测因阿拉伯世界的方言多样性而极具挑战。当前主流的阿拉伯语仇恨言论数据集大多仅覆盖单一方言或单一仇恨言论类别，且在方言、主题以及仇恨/非仇恨类别间存在分布失衡问题。针对这一研究空白，本文推出ADHAR——一款面向阿拉伯语的多方言、多类别综合性仇恨言论语料库。ADHAR共包含70369个词汇，涵盖四类语言变体：现代标准阿拉伯语（Modern Standard Arabic，MSA）、埃及方言、黎凡特方言、海湾方言与马格里布方言。该数据集覆盖四大核心仇恨言论类别：国籍、宗教、族群与种族。本数据集的一项重要贡献在于，经过严格甄选与校准，确保在方言、类别以及仇恨/非仇恨类别间保持分布均衡，为无偏数据集评估提供支撑。本文详细阐述了系统化的数据收集方法，随后介绍了严格的标注流程，针对每一方言均配备多名标注人员开展标注工作。大量定性与定量分析验证了ADHAR的质量与应用价值。针对多种经典机器学习与深度学习模型开展的实验表明，本数据集可用于开发鲁棒性更强的阿拉伯语仇恨言论分类器：在仇恨言论检测任务中，模型准确率与F1分数最高可达90%，类别识别任务最高可达92%；当使用Arabert模型进行训练时，仇恨言论检测任务的准确率与F1分数可达94%，类别识别任务则可达95%。

创建时间：

2024-05-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集