five

Kiswahili Hate speech dataset

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/rwhcyz3ndn
下载链接
链接失效反馈
官方服务:
资源简介:
Kiswahili Hate Speech Dataset (KHS-2026) Overview KHS-2026 is a monolingual Kiswahili dataset created for hate speech detection in low-resource African languages. It contains 6,451 annotated social media posts/comments collected from public sources in East Africa, primarily Kenya. Size & Classes Total: 6,451 items Neutral: ~65% (≈4,285) Offensive: ~25% (≈1,565) Hate Speech: ~10% (≈601) The severe imbalance reflects real-world social media, where most content is neutral or mildly critical, while explicit hate is rare but harmful. Sources Public posts/comments from: Twitter (now X) Facebook YouTube Focus: Sensitive topics including politics, protests, gender, ethnicity, and religion (mostly Kenyan discourse). Collection & Preprocessing Raw collection: ~6,539 items (Twitter API, Facebook CrowdTangle, YouTube API + manual) Preprocessing: Conservative cleaning Removed duplicates Normalized whitespace & UTF-8 Filtered broken URLs, excess punctuation, irrelevant metadata Kept emojis, hashtags, mentions for context Minimal spelling fixes; no stemming/lemmatization Final size after cleaning: 6,451 items Annotation Annotators: 3 native/fluent Kiswahili speakers (linguistics/comms background) Training: 2-day session + pilot on 500 samples Scheme: 3 primary labels – Neutral / Offensive / Hate Speech Inter-annotator agreement: Free-marginal Randolph’s Kappa = 0.72 (fair/good) Disagreements: Resolved via discussion + majority vote Validation Baseline models (TF-IDF + 75/25 split): Overall accuracy: ~69% Logistic Regression: Neutral F1 ≈ 0.81; Offensive/Hate F1 ≈ 0.41 each SVM: Strong precision but very low recall on minorities (F1 0.13–0.22) Poor minority-class performance due to imbalance + subtle expressions (sarcasm, metaphors, cultural nuance). Purpose & Release Provides a culturally grounded resource for Kiswahili NLP, complements AfriHate, and supports content moderation in East/Central Africa. Planned release: Public under CC BY-NC 4.0 with datasheet (following Gebru et al., 2018) detailing sources, biases, limitations (dialect variation, code-mixing risks, subtle hostility). Key insight The dataset shows reliable neutral annotation but highlights major challenges in detecting nuanced hate/offensive content in African languages – a call for better imbalance handling and culturally informed models.
创建时间:
2026-02-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作