naimul011/BanglaToxicCommentsDB
收藏Hugging Face2023-07-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/naimul011/BanglaToxicCommentsDB
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- bn
tags:
- toxic comments
size_categories:
- 10K<n<100K
---
# Dataset Card for Dataset Name
## Dataset Description
- **Homepage:**
- **Repository:**
- [Toxic-Comment-Detection-BN](https://github.com/imbodrulalam/Toxic-Comment-Detection-BN)
- **Paper:**
- [Bangla Toxic Comment Classification and Severity Measure Using Deep Learning](https://www.researchgate.net/publication/368895245_Bangla_Toxic_Comment_Classification_and_Severity_Measure_Using_Deep_Learning)
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Since the deep learning approach needs a huge number
of data for model training so it was a major challenge for
us to collect a large amount of data to train our model.
Some sample comments that we have collected are given
below:
ছাগেলর বাƐা ছাগল
েদেখ পুড়াই িহজড়ার মেতা েদখেত
পাডার েপা পাডা েতাের ময্ানেহােল ডু বাইয়া মারেত পারতাম যিদ
We have collected almost 4141 labeled data from the
previous work of Bangla toxic comment by Jubaer et al.
[6], which are described in table 1. For more data, we have
collected a total of 22, 000 comments have been collected
from Tiktok, the majority of which are toxic comments.
Our experts labeled these comments based on 6 categories
that are not mutually exclusive. All the annotators are
given clear guidelines on how to rate these comments.
The guidelines can be summarized in Table I.

The annotated comments are cleaned by removing
emoticons, unnecessary punctuation marks, characters,
digits, and other symbols as they contribute very little
to the context of the comments.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
[More Information Needed]
提供机构:
naimul011
原始信息汇总
数据集概述
数据集名称
- 名称: Toxic-Comment-Detection-BN
数据集描述
- 语言: 孟加拉语 (bn)
- 标签: 有毒评论
- 大小类别: 10,000 < n < 100,000
数据集来源
- 论文: Bangla Toxic Comment Classification and Severity Measure Using Deep Learning
- 数据收集: 从Tiktok收集了总计22,000条评论,其中大部分为有毒评论。
数据集内容
- 数据量: 总共收集了4,141条标记数据,以及22,000条来自Tiktok的评论。
- 数据标注: 评论由专家根据6个非互斥的类别进行标注,并提供了清晰的标注指南。
- 数据清洗: 移除了表情符号、不必要的标点符号、字符、数字及其他符号。
数据集使用
- 目的: 用于深度学习模型训练,特别是针对孟加拉语的有毒评论分类和严重性测量。
数据集结构
- 数据实例、字段和分割: 信息缺失
数据集创建
- 数据收集和规范化、标注过程、标注者身份、个人和敏感信息: 信息缺失
使用考虑
- 社会影响、偏见讨论、其他已知限制: 信息缺失
附加信息
- 数据集管理员、许可信息、引用信息、贡献: 信息缺失



