five

Code-mixed Chaos : Multi-labeled Banglish & Bangla Corpus for Toxicity analysis

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/23dp3t88vk
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset addresses a crucial gap in toxicity detection for Banglish—a code-mixed form of Bengali and English written in Roman script—which is often undervalued in NLP research. To mitigate this, we present a manually collected, multi-labeled dataset comprising 10,234 Banglish social media comments, annotated across 10 classes with toxic and non-toxic categories. The toxic comments are categorized into nine types: (1) Vulgar-based, (2) Religious-Hostility, (3) Troll-based, (4) Insult-based, (5) Loathe-based, (6) Threat-based, (7) Race-based, (8) Sexual-based, and (9) Political-Chaos. And a single Non-toxic category representing comments that do not have any form of toxicity. It is equally divided between toxic (5,117) and non-toxic (5,117) entries. Each sample was sourced from platforms such as Facebook, YouTube, Instagram, and X (formerly Twitter). To balance the dataset, it is enriched by selectively adding non-toxic texts from a publicly available corpus: "Bengali & Banglish: A Monolingual Dataset for Emotion Detection in Linguistically Diverse Contexts". Additionally, we provided a Bangla-translated version of the dataset to support the script-based comparative analysis in toxicity detection.

本数据集针对孟加拉英语混合语(Banglish,一种以罗马字母书写的孟加拉语与英语代码混合形式)的毒性检测领域长期存在的关键空白——此类语言在自然语言处理(Natural Language Processing,NLP)研究中常被忽视。为弥补这一不足,我们构建了人工采集标注的多标签数据集,包含10234条孟加拉英语混合语社交媒体评论,共设置10个分类标签,涵盖有毒与无毒两大范畴。其中有毒评论细分为9个类型:(1) 低俗言论类、(2) 宗教敌意类、(3) 引战类、(4) 侮辱类、(5) 憎恶类、(6) 威胁类、(7) 种族歧视类、(8) 性相关类、(9) 政治乱象类;另有1个无毒类别,用于标注无任何毒性的评论。该数据集在有毒与无毒样本间实现了均衡分布,两类样本各5117条。所有样本均采集自Facebook、YouTube、Instagram及X(原Twitter)等社交平台。为平衡数据集分布,我们从公开语料库"Bengali & Banglish: A Monolingual Dataset for Emotion Detection in Linguistically Diverse Contexts"中选择性添加无毒文本以扩充数据集。此外,我们还提供了该数据集的孟加拉语译本,以支持基于书写脚本的毒性检测对比分析。
创建时间:
2025-07-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作