five

BTTC - A Bangla Tri-class Text Corpus for Spam, Ham, and Promotional Messages

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/5wrm959d6f
下载链接
链接失效反馈
官方服务:
资源简介:
--------------------------------------------------------------------- 1. DATASET DESCRIPTION --------------------------------------------------------------------- BTTC (Bangla Tri-class Text Corpus) contains 10,283 unique labeled Bangla messages collected from various SMS users and public Telegram channels in Bangladesh. Unlike traditional binary datasets (Spam vs. Ham) in other languages, BTTC introduces a third category—"Promotional" (PROMO)—to distinguish legitimate marketing messages from malicious spam messages and normal ham messages. This dataset is designed to facilitate research in Bangla language spam detection, phishing identification, and linguistic analysis of promotional messages. It captures the linguistic shift from traditional SMS to modern messaging platforms like Telegram. --------------------------------------------------------------------- 2. CLASS DISTRIBUTION (Total: 10,283) --------------------------------------------------------------------- 1. HAM (3,904 messages): - Legitimate personal conversations. - Transactional notifications (Bank, Mobile Financial Services). - Government alerts and public service announcements. 2. PROMO (3,695 messages): - Marketing and promotional offers from telecom operators (GP, Robi, Airtel, Banglalink, Teletalk). - Focused on data packs, voice minutes, and bundle offers. 3. SPAM (2,684 messages): - Fraudulent messages and phishing attempts. - Online gambling and betting promotions. - Financial scams and fake prize offers. --------------------------------------------------------------------- 3. FILE STRUCTURE & COLUMNS --------------------------------------------------------------------- File Name: BTTC.csv Columns: A. Text: The raw messages, preserved in the same state as when they were collected (masked for privacy). B. Text_Clean: Cleaned version of Text, by removing line breaks (CHAR(10) and CHAR(13)), non-printable characters, and extra spaces in Excel. C. Label: The classification tag (HAM, PROMO, SPAM). D. Source: Platform origin (SMS or Telegram). E. Annotation_Process: - MANUAL: These messages were manually annotated by 3 annotators. - ANNOTATOR-REVIEWED: These messages were pseudo labeled using manually labeled data, then reviewed and fixed by 3 annotators. *PRIVACY NOTE:* All Personally Identifiable Information (PII) has been masked using a regex algorithm. - Phone Numbers: Last 5 digits are replaced with 'XXXXX'. - URLs: Domain kept (e.g., t.me), path masked (e.g., https://t.me/XXXXX). - Emails: Local part (before the @) is masked with 'XXXXX'. - Transaction IDs: Replaced with 'XXXXX'.
创建时间:
2026-03-05
二维码
社区交流群
二维码
科研交流群
商业服务