Code-mixed Chaos : Multi-labeled Banglish & Bangla Corpus for Toxicity analysis

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/23dp3t88vk

下载链接

链接失效反馈

官方服务：

资源简介：

The dataset addresses a crucial gap in toxicity detection for Banglish—a code-mixed form of Bengali and English written in Roman script—which is often undervalued in NLP research. To mitigate this, we present a manually collected, multi-labeled dataset comprising 10,234 Banglish social media comments, annotated across 10 classes with toxic and non-toxic categories. The toxic comments are categorized into nine types: (1) Vulgar-based, (2) Religious-Hostility, (3) Troll-based, (4) Insult-based, (5) Loathe-based, (6) Threat-based, (7) Race-based, (8) Sexual-based, and (9) Political-Chaos. And a single Non-toxic category representing comments that do not have any form of toxicity. It is equally divided between toxic (5,117) and non-toxic (5,117) entries. Each sample was sourced from platforms such as Facebook, YouTube, Instagram, and X (formerly Twitter). To balance the dataset, it is enriched by selectively adding non-toxic texts from a publicly available corpus: "Bengali & Banglish: A Monolingual Dataset for Emotion Detection in Linguistically Diverse Contexts". Additionally, we provided a Bangla-translated version of the dataset to support the script-based comparative analysis in toxicity detection.

创建时间：

2025-07-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集