Bangla NLP Dataset for Sentiment Analysis, Topic Classification, and Hate Speech Detection
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/pv3mr44472
下载链接
链接失效反馈官方服务:
资源简介:
Introduction:
The Bangla NLP Dataset for Sentiment Analysis, Topic Classification, and Hate Speech Detection is a manually curated Bangla text dataset designed to support research on low-resource Natural Language Processing. The data has been collected from several well-known Bangla newspapers, including Prothom Alo, Jugantor, Kaler Kantho, and Bangladesh Pratidin, ensuring linguistic diversity and content reliability. Consistent preprocessing and labeling guidelines were applied to facilitate reproducible experimentation across multiple Bangla NLP classification tasks under both high-resource and low-resource learning settings.
Dataset Overview:
This dataset consists of three task-specific Bangla NLP datasets, each constructed with balanced class distributions:
Sentiment Analysis Dataset
Classes: Positive, Negative, Neutral
Samples per class: 1000
Total samples: 3000
Topic Classification Dataset
Classes: Bangladesh, International, Sports, Entertainment
Samples per class: 1000
Total samples: 4000
Hate Speech Detection Dataset
Classes: Hate, Non-Hate
Samples per class: 1000
Total samples: 2000
All datasets are sentence-level, manually labeled, and preprocessed using a unified pipeline to ensure consistency across tasks and fair comparative evaluation.
Applications and Motivation:
This dataset supports a wide range of Bangla NLP applications, including sentiment analysis, topic classification, and hate speech detection. The primary motivation behind collecting this dataset is to enable few-shot learning research for Bangla, where large-scale labeled data is often unavailable. The balanced and task-diverse structure of the dataset makes it particularly suitable for evaluating data-efficient learning methods, such as few-shot learning and metric-based approaches. It can also be used for benchmarking supervised, few-shot, and low-resource NLP models for the Bangla language.
创建时间:
2026-01-05



