Kiswahili Hate speech dataset
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/rwhcyz3ndn
下载链接
链接失效反馈官方服务:
资源简介:
Kiswahili Hate Speech Dataset (KHS-2026)
Overview
KHS-2026 is a monolingual Kiswahili dataset created for hate speech detection in low-resource African languages. It contains 6,451 annotated social media posts/comments collected from public sources in East Africa, primarily Kenya.
Size & Classes
Total: 6,451 items
Neutral: ~65% (≈4,285)
Offensive: ~25% (≈1,565)
Hate Speech: ~10% (≈601)
The severe imbalance reflects real-world social media, where most content is neutral or mildly critical, while explicit hate is rare but harmful.
Sources
Public posts/comments from:
Twitter (now X)
Facebook
YouTube
Focus: Sensitive topics including politics, protests, gender, ethnicity, and religion (mostly Kenyan discourse).
Collection & Preprocessing
Raw collection: ~6,539 items (Twitter API, Facebook CrowdTangle, YouTube API + manual)
Preprocessing: Conservative cleaning
Removed duplicates
Normalized whitespace & UTF-8
Filtered broken URLs, excess punctuation, irrelevant metadata
Kept emojis, hashtags, mentions for context
Minimal spelling fixes; no stemming/lemmatization
Final size after cleaning: 6,451 items
Annotation
Annotators: 3 native/fluent Kiswahili speakers (linguistics/comms background)
Training: 2-day session + pilot on 500 samples
Scheme: 3 primary labels – Neutral / Offensive / Hate Speech
Inter-annotator agreement: Free-marginal Randolph’s Kappa = 0.72 (fair/good)
Disagreements: Resolved via discussion + majority vote
Validation
Baseline models (TF-IDF + 75/25 split):
Overall accuracy: ~69%
Logistic Regression: Neutral F1 ≈ 0.81; Offensive/Hate F1 ≈ 0.41 each
SVM: Strong precision but very low recall on minorities (F1 0.13–0.22)
Poor minority-class performance due to imbalance + subtle expressions (sarcasm, metaphors, cultural nuance).
Purpose & Release
Provides a culturally grounded resource for Kiswahili NLP, complements AfriHate, and supports content moderation in East/Central Africa.
Planned release: Public under CC BY-NC 4.0 with datasheet (following Gebru et al., 2018) detailing sources, biases, limitations (dialect variation, code-mixing risks, subtle hostility).
Key insight
The dataset shows reliable neutral annotation but highlights major challenges in detecting nuanced hate/offensive content in African languages – a call for better imbalance handling and culturally informed models.
创建时间:
2026-02-10



