notd5a/malicious-benign-sms-mms-dataset

Name: notd5a/malicious-benign-sms-mms-dataset
Creator: notd5a
Published: 2026-03-21 18:56:40
License: 暂无描述

Hugging Face2026-03-21 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/notd5a/malicious-benign-sms-mms-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - text-classification language: - en tags: - spam - smishing - benign - malicious - classification size_categories: - 100K<n<1M --- # Dataset v3 Changelog Changes from dataset v2 (`model_datasets/v4/`) to v3 (`model_datasets/v2-4/`). ## Summary v3 is a curated, rebalanced, and feature-enriched derivative of v2. The goal was to improve training signal quality by removing noisy examples, correcting mislabelled data, fixing the short-message class imbalance, and adding 23 engineered text features. | | v2 | v3 (base) | v3 (DeBERTa) | |---|---|---|---| | **File** | `dataset_v4_dual_cleaned_v2.csv` | `dataset_v2.4.csv` | `dataset_v2-4_deberta.csv` | | **Total rows** | 777,480 | 442,282 | 268,340 | | **Benign** | 700,364 | 365,406 | 194,504 | | **Spam** | 77,116 | 76,876 | 73,836 | | **Benign:Spam** | 9.1:1 | 4.8:1 | 2.6:1 | | **AI-generated** | 68,569 | 59,035 | 49,316 | | **Engineered features** | 0 | 0 | 23 | v3 exists in three stages: - **`dataset_v3.csv`** — cleaned and pruned base (442k rows, 3 columns) - **`dataset_v23_for_deberta.csv`** — length-stratified undersampled + enriched, used for DeBERTa training (268k rows, 26 columns) --- ## Changes applied ### 1. Benign pruning — 334,822 messages removed Removed excess benign messages to reduce the 9.1:1 class imbalance down to 4.8:1. The v2 dataset was heavily skewed toward short benign messages — 86.5% of benign was <=80 characters while 85% of spam was >80 characters, making message length a trivial proxy for the label. Pruning targeted short benign messages disproportionately to flatten the length distribution across classes. ### 2. AI-benign cleanup — 10,890 AI-generated benign messages removed v2 contained 31,368 benign messages incorrectly marked `ai_generated=1`. These were synthetic bank notifications, appointment reminders, and carrier messages that were labelled as AI-generated during data augmentation but are actually benign content. The v2 changelog had already corrected 5,255 of these by setting `ai_generated=0`, but 10,890 remained. v3 drops these entirely to avoid training noise — the model should not learn that benign service messages are AI-generated. | | v2 | v3 (base) | |---|---|---| | AI-generated benign | 31,368 | 20,473 | | AI-generated spam | 37,201 | 38,562 | ### 3. New AI-generated spam — 1,617 messages added Added 1,617 new AI-generated smishing messages not present in any v2 file. These are targeted spear-smishing examples covering international banking scams (SBI, Nedbank, etc.), obfuscated URLs with Unicode substitution, and trading/investment lures. ### 4. Label corrections — 148 spam labels + 16 AI labels **Spam label corrections (148):** - 142 messages changed from benign (0) to spam (1) — these had phishing signals (URLs, urgency language, impersonation) but were mislabelled as benign - 6 messages changed from spam (1) to benign (0) — legitimate service notifications incorrectly flagged as spam **AI label corrections (16):** - 16 messages changed from human (0) to AI-generated (1) ### 5. Label audit — 9,799 flagged entries A systematic audit (`label_audit_report.csv`) was run to identify remaining label quality issues. The audit flags messages that may need review but does not automatically change labels. | Flag | Count | Description | |---|---|---| | Truncated message | 8,956 | Message appears to start mid-sentence or ends with a function word, suggesting it was cut during data collection | | Customer service message | 456 | Spam-labelled message that looks like a legitimate commercial notification | | Legitimate brand name | 112 | Contains real retail/service provider names with legitimate offers | | Gaming discussion | 81 | Non-spam chat content about games | | Legitimate transaction alert | 80 | Real bank/payment notifications | | Brand impersonation | 27 | Impersonations of real brands (correctly labelled as spam) | | Urgency/coercion language | 11 | Messages with manipulative pressure tactics | Most flagged entries (91%) are truncated messages. These remain in the dataset as they still carry signal, but the audit identifies them for potential future cleanup. ### 6. Short message removal — all messages <30 chars dropped v3 DeBERTa dataset removes all messages shorter than 30 characters. In v2, the 0-30 char bin contained 42,392 benign messages but only 850 spam — a 50:1 local ratio that degrades the classifier's ability to learn from short text. | Length bin | v2 benign | v2 spam | v3 DeBERTa benign | v3 DeBERTa spam | |---|---|---|---|---| | 0-30 | ~335k short msgs | 850 | 0 | 0 | | 30-60 | 160,284 | 3,215 | 30,924 | 1,025 | | 60-100 | 106,612 | 11,933 | 106,612 | 11,933 | | 100-160 | 42,815 | 45,515 | 42,815 | 45,515 | | 160-256 | 11,199 | 12,316 | 11,199 | 12,316 | | 256+ | 2,954 | 3,047 | 2,954 | 3,047 | Short messages (<=60 chars) are now handled by a dedicated CharCNN model in the hybrid routing system, so the DeBERTa training set no longer needs them. ### 7. Length-stratified undersampling — 442,282 to 268,340 The 30-60 char benign bin was still overrepresented (160,284 benign vs 3,215 spam = 50:1). Length-stratified undersampling caps the benign:spam ratio within each length bin, bringing the overall ratio from 4.8:1 down to 2.6:1. | | v3 base | v3 DeBERTa | |---|---|---| | Total | 442,282 | 268,340 | | Benign | 365,406 | 194,504 | | Spam | 76,876 | 73,836 | | Benign:Spam | 4.8:1 | 2.6:1 | | Benign dropped | — | 170,902 | | Spam dropped | — | 3,040 | ### 8. Feature enrichment — 23 engineered text features All v3 files ending in `_enriched` or `_deberta` have 23 handcrafted features appended as additional columns. These are computed by `data_preprocessing_scripts/data_enrichment.py` and standardised with a fitted `StandardScaler` saved as `scaler.pkl`. **Original 15 features:** | Feature | Type | Description | |---|---|---| | `char_count` | int | Total character count | | `word_count` | int | Total word count | | `avg_word_length` | float | Mean word length | | `uppercase_ratio` | float | Uppercase letters / all letters | | `digit_ratio` | float | Digits / total characters | | `special_char_ratio` | float | Non-alphanumeric, non-space / total characters | | `exclamation_count` | int | Count of `!` | | `question_mark_count` | int | Count of `?` | | `has_url` | binary | Contains URL pattern | | `url_count` | int | Number of URLs detected | | `has_shortened_url` | binary | Contains bit.ly, t.co, etc. | | `has_phone_number` | binary | Contains phone number (>=7 digits) | | `has_email` | binary | Contains email address | | `has_currency` | binary | Contains currency symbol or code | | `urgency_score` | int | Count of urgency keywords matched | **Evasion detection features (8 new):** | Feature | Type | Description | |---|---|---| | `unicode_ratio` | float | Non-ASCII characters / total characters | | `char_entropy` | float | Shannon entropy over character distribution | | `suspicious_spacing` | int | Count of spaced-out word patterns (e.g. "w o r d") | | `leet_ratio` | float | Characters that map to leet translations / total | | `max_digit_run` | int | Longest consecutive digit sequence | | `repeated_char_ratio` | float | Consecutive repeated chars / (length - 1) | | `vocab_richness` | float | Unique words / total words | | `has_obfuscated_url` | binary | Detects evasive URL patterns | --- ## Files | File | Rows | Columns | Description | |---|---|---|---| | `dataset_v3.csv` | 442,282 | 3 | Cleaned and pruned base dataset (message, label, ai_generated) | | `dataset_v3_for_deberta.csv` | 268,340 | 26 | Length-stratified undersampled + enriched — **used for DeBERTa training** | | `original_dataset_v2.csv` | 813,546 | 3 | Cleaned Original Dataset | | `original_dataset_v2_undersampled_stratified.csv` | 376,851 | 3 | Cleaned Original Dataset with random undersampling and length stratification applied | ## Column schema ``` message — SMS/MMS message text (string) label — 0 = benign, 1 = spam/smishing ai_generated — 0 = human-written, 1 = AI-generated char_count — (enriched only) int word_count — (enriched only) int ... — 21 more engineered features (see above) ``` --- ## Data sources The dataset is built from three primary collections merged via `data_preprocessing_scripts/merge_csvs.py`, supplemented by public datasets for benign diversity and LLM-generated synthetic spam. ### Primary sources **1. Discord messages** (~745,400 messages) — benign conversational text Private collection of Discord server exports in JSONL format, extracted and cleaned via `data_preprocessing_scripts/discord_preprocessor.py`. Provides the majority of benign conversational messages. Discord-specific syntax (mentions, custom emoji, invite links, tokens) is stripped during preprocessing. **2. Combined public SMS datasets** (~74,900 messages) — ham + spam labelled Merged from three public sources into `cleaned_extern_data.csv`: - **UCI SMS Spam Collection** — 5,574 messages (4,827 ham + 747 spam). The foundational SMS spam benchmark. - URL: https://archive.ics.uci.edu/dataset/228/sms+spam+collection - HuggingFace mirror: https://huggingface.co/datasets/ucirvine/sms_spam - Citation: Almeida, T.A., Gomez Hidalgo, J.M., Yamakami, A. (2011). *Contributions to the Study of SMS Spam Filtering: New Collection and Results.* Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11). DOI: [10.24432/C5CC84](https://doi.org/10.24432/C5CC84) - **mshenoda/spam-messages** — Merged compilation of SMS Spam Collection, Telegram Spam Ham, and Enron Spam datasets. - URL: https://huggingface.co/datasets/mshenoda/spam-messages - Sub-sources: - [thehamkercat/telegram-spam-ham](https://huggingface.co/datasets/thehamkercat/telegram-spam-ham) — Telegram message spam/ham classification - [SetFit/enron_spam](https://huggingface.co/datasets/SetFit/enron_spam) — Spam/ham subset of the Enron email corpus (originally released by the Federal Energy Regulatory Commission, processed by CMU) - **vinit9638/SMS-scam-detection-dataset** — 138,813 multilingual text entries. Filtered to English ham messages only. - URL: https://github.com/vinit9638/SMS-scam-detection-dataset **3. SpamDam Twitter data** (~36,700 messages) — SMS-like spam from social media Spam and ham messages collected from Twitter/X, providing social media-style short text patterns. - Citation: Li, Y., Zhang, R., Rong, W., Mi, X. (2024). *SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection.* arXiv: [2404.09481](https://arxiv.org/abs/2404.09481) - Project page: https://chasesecurity.github.io/SpamDam/ ### Supplementary public datasets Used for benign diversity sourcing (via `source_samples.py`) to reduce false positives on legitimate service messages: - **NUS SMS Corpus** — ~67,000 conversational SMS messages from National University of Singapore students. All benign. - URL: https://github.com/WING-NUS/nus-sms-corpus - Kaggle mirror: https://www.kaggle.com/datasets/rtatman/the-national-university-of-singapore-sms-corpus - Citation: Chen, T. and Kan, M.-Y. (2013). *Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus.* Language Resources and Evaluation, 47(2), 299-355. DOI: [10.1007/s10579-012-9197-9](https://doi.org/10.1007/s10579-012-9197-9) - **Mendeley SMS Phishing Dataset** — 5,971 messages (4,844 ham + 489 spam + 638 smishing). Ham messages used for benign diversity. - URL: https://data.mendeley.com/datasets/f45bkkt8pr/1 - Citation: Mishra, S. and Soni, D. (2022). *SMS Phishing Dataset for Machine Learning and Pattern Recognition.* Mendeley Data, V1. DOI: [10.17632/f45bkkt8pr.1](https://doi.org/10.17632/f45bkkt8pr.1) ### Synthetic data - **LLM-generated smishing** — AI-generated spam produced via a three-phase pipeline (`data_augment.py`) cycling through multiple local LLMs (Llama, Mistral, Qwen, Gemma, Granite, and others): - Phase 1: Paraphrasing existing human spam with varied prompts - Phase 2: Spear-smishing from person profiles x attack scenario templates - Phase 3: Style-twist — benign messages transformed into smishing - **LLM-generated hard negatives** — Synthetic benign messages generated via Claude (`source_samples.py`) targeting false positive categories identified in error analysis: retail promotions, bank transaction alerts, telecom notifications, delivery confirmations, appointment reminders, and OTP/verification codes.

提供机构：

notd5a

5,000+

优质数据集

54 个

任务类型

进入经典数据集