naimul011/BanglaToxicCommentsDB

Name: naimul011/BanglaToxicCommentsDB
Creator: naimul011
Published: 2023-07-17 12:55:58
License: 暂无描述

Hugging Face2023-07-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/naimul011/BanglaToxicCommentsDB

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - bn tags: - toxic comments size_categories: - 10K<n<100K --- # Dataset Card for Dataset Name ## Dataset Description - **Homepage:** - **Repository:** - [Toxic-Comment-Detection-BN](https://github.com/imbodrulalam/Toxic-Comment-Detection-BN) - **Paper:** - [Bangla Toxic Comment Classification and Severity Measure Using Deep Learning](https://www.researchgate.net/publication/368895245_Bangla_Toxic_Comment_Classification_and_Severity_Measure_Using_Deep_Learning) - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Since the deep learning approach needs a huge number of data for model training so it was a major challenge for us to collect a large amount of data to train our model. Some sample comments that we have collected are given below: ছাগেলর বাƐা ছাগল েদেখ পুড়াই িহজড়ার মেতা েদখেত পাডার েপা পাডা েতাের ময্ানেহােল ডু বাইয়া মারেত পারতাম যিদ We have collected almost 4141 labeled data from the previous work of Bangla toxic comment by Jubaer et al. [6], which are described in table 1. For more data, we have collected a total of 22, 000 comments have been collected from Tiktok, the majority of which are toxic comments. Our experts labeled these comments based on 6 categories that are not mutually exclusive. All the annotators are given clear guidelines on how to rate these comments. The guidelines can be summarized in Table I. ![Alt text](Capture.PNG) The annotated comments are cleaned by removing emoticons, unnecessary punctuation marks, characters, digits, and other symbols as they contribute very little to the context of the comments. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]

提供机构：

naimul011

原始信息汇总

数据集概述

数据集名称

名称: Toxic-Comment-Detection-BN

数据集描述

语言: 孟加拉语 (bn)
标签: 有毒评论
大小类别: 10,000 < n < 100,000

数据集来源

论文: Bangla Toxic Comment Classification and Severity Measure Using Deep Learning
数据收集: 从Tiktok收集了总计22,000条评论，其中大部分为有毒评论。

数据集内容

数据量: 总共收集了4,141条标记数据，以及22,000条来自Tiktok的评论。
数据标注: 评论由专家根据6个非互斥的类别进行标注，并提供了清晰的标注指南。
数据清洗: 移除了表情符号、不必要的标点符号、字符、数字及其他符号。

数据集使用

目的: 用于深度学习模型训练，特别是针对孟加拉语的有毒评论分类和严重性测量。

数据集结构

数据实例、字段和分割: 信息缺失

数据集创建

数据收集和规范化、标注过程、标注者身份、个人和敏感信息: 信息缺失

使用考虑

社会影响、偏见讨论、其他已知限制: 信息缺失

附加信息

数据集管理员、许可信息、引用信息、贡献: 信息缺失

5,000+

优质数据集

54 个

任务类型

进入经典数据集