Bengali Spam Comment Dataset for Social Media Content Moderation
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/s2gxmnjrt8
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains 9,000 balanced Bengali comments labeled for spam classification on social media platforms. Each entry includes the original comment, its spam classification label, and a processed version suitable for machine learning model training. The dataset is structured into three columns: COMMENTS, CLASS, and English Version, providing a comprehensive resource for developing spam detection systems in Bengali language processing.
Dataset Structure
COMMENTS – A text field containing original, unmodified comments written in Bengali (Bangla). These comments are collected from popular social media platforms and include expressions of engagement, promotional content, personal communication, generic engagement-boosting messages, bot-generated posts, and scam-related content typical of social media interactions. This column preserves the raw data as originally collected from social media sources.
CLASS – A categorical variable indicating whether the comment is spam or non-spam. Binary classification with the following labels:
1 (Spam): Comments containing promotional content with money-making schemes, suspicious links/URLs, generic engagement-boosting comments, bot-generated repetitive messages, or scam-related content promising unrealistic financial returns
0 (Non-Spam): Comments that are contextually relevant, show genuine user engagement, and contribute meaningfully to discussions
COMMENTS_PROCESSED – A preprocessed version of the original comments optimized for machine learning and NLP model training. This column contains the cleaned, normalized text after applying a comprehensive Bengali-specific preprocessing pipeline while preserving semantic meaning.
Key Features
1. Total Records: 9,000 balanced comments
2. Language: Bengali (Bangla)
3. Class Distribution: 4,500 spam comments (50%) and 4,500 non-spam comments (50%)
4. Classification Type: Binary spam classification
5. Data Collection Period: Recent social media activity
6. Average Comment Length: 17 words (std. dev. ± 15 words)
Spam comments: Average 20 words
- Non-spam comments: Average 18 words
- Comment Length Range: Minimum 1 word, Maximum 52 words
- Missing Values: None (all 9,000 records contain values in all three columns)
Data Types:
- COMMENTS (string)
- CLASS (integer: 0 or 1)
- COMMENTS_PROCESSED (string)
Data Collection Sources
Comments were systematically collected from:
1. Platforms: Facebook pages and YouTube channels
2. Content Domains: News, entertainment, sports, technology, and lifestyle
3. Notable Sources: Jamuna TV, Ekattor TV, Bongo BD, 10 Minute School, and other high-traffic Bengali content channels
4. Collection Method: Official Facebook Graph API and YouTube API for publicly accessible comments
5. Privacy Protection: All personal details (usernames, phone numbers, email addresses) have been scrubbed in both raw and processed versions
Here two csv files are given one with the original comments and another with processed ones.
创建时间:
2026-02-11



