Code-mixed Chaos : Multi-labeled Banglish & Bangla Corpus for Toxicity analysis
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/23dp3t88vk
下载链接
链接失效反馈官方服务:
资源简介:
The dataset addresses a crucial gap in toxicity detection for Banglish—a code-mixed form of Bengali and English written in Roman script—which is often undervalued in NLP research. To mitigate this, we present a manually collected, multi-labeled dataset comprising 10,234 Banglish social media comments, annotated across 10 classes with toxic and non-toxic categories. The toxic comments are categorized into nine types: (1) Vulgar-based, (2) Religious-Hostility, (3) Troll-based, (4) Insult-based, (5) Loathe-based, (6) Threat-based, (7) Race-based, (8) Sexual-based, and (9) Political-Chaos. And a single Non-toxic category representing comments that do not have any form of toxicity. It is equally divided between toxic (5,117) and non-toxic (5,117) entries. Each sample was sourced from platforms such as Facebook, YouTube, Instagram, and X (formerly Twitter). To balance the dataset, it is enriched by selectively adding non-toxic texts from a publicly available corpus: "Bengali & Banglish: A Monolingual Dataset for Emotion Detection in Linguistically Diverse Contexts". Additionally, we provided a Bangla-translated version of the dataset to support the script-based comparative analysis in toxicity detection.
创建时间:
2025-07-11



