Bangla NLP Dataset for Sentiment Analysis, Topic Classification, and Hate Speech Detection

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/pv3mr44472

下载链接

链接失效反馈

官方服务：

资源简介：

Introduction: The Bangla NLP Dataset for Sentiment Analysis, Topic Classification, and Hate Speech Detection is a manually curated Bangla text dataset designed to support research on low-resource Natural Language Processing. The data has been collected from several well-known Bangla newspapers, including Prothom Alo, Jugantor, Kaler Kantho, and Bangladesh Pratidin, ensuring linguistic diversity and content reliability. Consistent preprocessing and labeling guidelines were applied to facilitate reproducible experimentation across multiple Bangla NLP classification tasks under both high-resource and low-resource learning settings. Dataset Overview: This dataset consists of three task-specific Bangla NLP datasets, each constructed with balanced class distributions: Sentiment Analysis Dataset Classes: Positive, Negative, Neutral Samples per class: 1000 Total samples: 3000 Topic Classification Dataset Classes: Bangladesh, International, Sports, Entertainment Samples per class: 1000 Total samples: 4000 Hate Speech Detection Dataset Classes: Hate, Non-Hate Samples per class: 1000 Total samples: 2000 All datasets are sentence-level, manually labeled, and preprocessed using a unified pipeline to ensure consistency across tasks and fair comparative evaluation. Applications and Motivation: This dataset supports a wide range of Bangla NLP applications, including sentiment analysis, topic classification, and hate speech detection. The primary motivation behind collecting this dataset is to enable few-shot learning research for Bangla, where large-scale labeled data is often unavailable. The balanced and task-diverse structure of the dataset makes it particularly suitable for evaluating data-efficient learning methods, such as few-shot learning and metric-based approaches. It can also be used for benchmarking supervised, few-shot, and low-resource NLP models for the Bangla language.

创建时间：

2026-01-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集