Moroccan Darija Offensive Language Detection Dataset
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/2y4m97b7dc
下载链接
链接失效反馈官方服务:
资源简介:
The Moroccan Darija dataset was cleaned by removing duplicate entries and discarding sentences with conflicting annotations. To address class imbalance, undersampling was applied to reduce the size of the majority (non-offensive) class.
The dataset was also augmented with samples from the OMCD corpus, which underwent the same preprocessing pipeline to ensure consistency, including emoji representation, normalization, removal of punctuation and diacritics, elimination of social media elements, elongation removal, and duplicate removal.
Finally, all entries from the Moroccan Darija dataset were relabeled using Claude 3.5 Sonnet to align with the comprehensive OMCD framework, covering both explicit and implicit forms of offensiveness such as vulgarity, hate speech, hostile intent, contempt, humiliation, and belittlement (OMCD reference).
Sentences where Claude-generated labels conflicting with previous annotations were flagged for manual review according to break the tie.
创建时间:
2025-09-23



