Bengali Toxic Comments Dataset
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/deepu099cse/Multi-Labeled-Bengali-Toxic-Comments-Classification
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含16,073个孟加拉语评论实例,这些评论被手动标注为有毒和无毒两类。有毒评论进一步细分为六个子类别:粗俗、仇恨、宗教、威胁、恶意挑衅和侮辱。该数据集由专业标注员手动分类,确保了高质量的数据标注,对于有毒和无毒类别,其平均kappa分数达到了0.96。规模上,数据集包含了16,073个实例,任务是对有毒评论进行多标签分类。
This dataset contains 16,073 Bengali review instances, which are manually annotated into two categories: toxic and non-toxic. Toxic reviews are further subdivided into six subcategories: vulgar, hate speech, religious, threatening, malicious provocation, and insulting. Manually classified by professional annotators, the dataset ensures high-quality data annotation, with an average Cohen's Kappa score of 0.96 for both the toxic and non-toxic categories. With a total of 16,073 instances, the task of this dataset is multi-label classification for toxic comments.
提供机构:
Authors of the paper



