five

Moroccan Darija Offensive Language Detection Dataset

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/2y4m97b7dc
下载链接
链接失效反馈
官方服务:
资源简介:
The Moroccan Darija dataset was cleaned by removing duplicate entries and discarding sentences with conflicting annotations. To address class imbalance, undersampling was applied to reduce the size of the majority (non-offensive) class. The dataset was also augmented with samples from the OMCD corpus, which underwent the same preprocessing pipeline to ensure consistency, including emoji representation, normalization, removal of punctuation and diacritics, elimination of social media elements, elongation removal, and duplicate removal. Finally, all entries from the Moroccan Darija dataset were relabeled using Claude 3.5 Sonnet to align with the comprehensive OMCD framework, covering both explicit and implicit forms of offensiveness such as vulgarity, hate speech, hostile intent, contempt, humiliation, and belittlement (OMCD reference). Sentences where Claude-generated labels conflicting with previous annotations were flagged for manual review according to break the tie.
创建时间:
2025-09-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作