notd5a/malicious-benign-sms-mms-dataset
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/notd5a/malicious-benign-sms-mms-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-classification
language:
- en
tags:
- spam
- smishing
- benign
- malicious
- classification
size_categories:
- 100K<n<1M
---
# Dataset v3 Changelog
Changes from dataset v2 (`model_datasets/v4/`) to v3 (`model_datasets/v2-4/`).
## Summary
v3 is a curated, rebalanced, and feature-enriched derivative of v2. The goal was to improve training signal quality by removing noisy examples, correcting mislabelled data, fixing the short-message class imbalance, and adding 23 engineered text features.
| | v2 | v3 (base) | v3 (DeBERTa) |
|---|---|---|---|
| **File** | `dataset_v4_dual_cleaned_v2.csv` | `dataset_v2.4.csv` | `dataset_v2-4_deberta.csv` |
| **Total rows** | 777,480 | 442,282 | 268,340 |
| **Benign** | 700,364 | 365,406 | 194,504 |
| **Spam** | 77,116 | 76,876 | 73,836 |
| **Benign:Spam** | 9.1:1 | 4.8:1 | 2.6:1 |
| **AI-generated** | 68,569 | 59,035 | 49,316 |
| **Engineered features** | 0 | 0 | 23 |
v3 exists in three stages:
- **`dataset_v3.csv`** — cleaned and pruned base (442k rows, 3 columns)
- **`dataset_v23_for_deberta.csv`** — length-stratified undersampled + enriched, used for DeBERTa training (268k rows, 26 columns)
---
## Changes applied
### 1. Benign pruning — 334,822 messages removed
Removed excess benign messages to reduce the 9.1:1 class imbalance down to 4.8:1. The v2 dataset was heavily skewed toward short benign messages — 86.5% of benign was <=80 characters while 85% of spam was >80 characters, making message length a trivial proxy for the label.
Pruning targeted short benign messages disproportionately to flatten the length distribution across classes.
### 2. AI-benign cleanup — 10,890 AI-generated benign messages removed
v2 contained 31,368 benign messages incorrectly marked `ai_generated=1`. These were synthetic bank notifications, appointment reminders, and carrier messages that were labelled as AI-generated during data augmentation but are actually benign content. The v2 changelog had already corrected 5,255 of these by setting `ai_generated=0`, but 10,890 remained.
v3 drops these entirely to avoid training noise — the model should not learn that benign service messages are AI-generated.
| | v2 | v3 (base) |
|---|---|---|
| AI-generated benign | 31,368 | 20,473 |
| AI-generated spam | 37,201 | 38,562 |
### 3. New AI-generated spam — 1,617 messages added
Added 1,617 new AI-generated smishing messages not present in any v2 file. These are targeted spear-smishing examples covering international banking scams (SBI, Nedbank, etc.), obfuscated URLs with Unicode substitution, and trading/investment lures.
### 4. Label corrections — 148 spam labels + 16 AI labels
**Spam label corrections (148):**
- 142 messages changed from benign (0) to spam (1) — these had phishing signals (URLs, urgency language, impersonation) but were mislabelled as benign
- 6 messages changed from spam (1) to benign (0) — legitimate service notifications incorrectly flagged as spam
**AI label corrections (16):**
- 16 messages changed from human (0) to AI-generated (1)
### 5. Label audit — 9,799 flagged entries
A systematic audit (`label_audit_report.csv`) was run to identify remaining label quality issues. The audit flags messages that may need review but does not automatically change labels.
| Flag | Count | Description |
|---|---|---|
| Truncated message | 8,956 | Message appears to start mid-sentence or ends with a function word, suggesting it was cut during data collection |
| Customer service message | 456 | Spam-labelled message that looks like a legitimate commercial notification |
| Legitimate brand name | 112 | Contains real retail/service provider names with legitimate offers |
| Gaming discussion | 81 | Non-spam chat content about games |
| Legitimate transaction alert | 80 | Real bank/payment notifications |
| Brand impersonation | 27 | Impersonations of real brands (correctly labelled as spam) |
| Urgency/coercion language | 11 | Messages with manipulative pressure tactics |
Most flagged entries (91%) are truncated messages. These remain in the dataset as they still carry signal, but the audit identifies them for potential future cleanup.
### 6. Short message removal — all messages <30 chars dropped
v3 DeBERTa dataset removes all messages shorter than 30 characters. In v2, the 0-30 char bin contained 42,392 benign messages but only 850 spam — a 50:1 local ratio that degrades the classifier's ability to learn from short text.
| Length bin | v2 benign | v2 spam | v3 DeBERTa benign | v3 DeBERTa spam |
|---|---|---|---|---|
| 0-30 | ~335k short msgs | 850 | 0 | 0 |
| 30-60 | 160,284 | 3,215 | 30,924 | 1,025 |
| 60-100 | 106,612 | 11,933 | 106,612 | 11,933 |
| 100-160 | 42,815 | 45,515 | 42,815 | 45,515 |
| 160-256 | 11,199 | 12,316 | 11,199 | 12,316 |
| 256+ | 2,954 | 3,047 | 2,954 | 3,047 |
Short messages (<=60 chars) are now handled by a dedicated CharCNN model in the hybrid routing system, so the DeBERTa training set no longer needs them.
### 7. Length-stratified undersampling — 442,282 to 268,340
The 30-60 char benign bin was still overrepresented (160,284 benign vs 3,215 spam = 50:1). Length-stratified undersampling caps the benign:spam ratio within each length bin, bringing the overall ratio from 4.8:1 down to 2.6:1.
| | v3 base | v3 DeBERTa |
|---|---|---|
| Total | 442,282 | 268,340 |
| Benign | 365,406 | 194,504 |
| Spam | 76,876 | 73,836 |
| Benign:Spam | 4.8:1 | 2.6:1 |
| Benign dropped | — | 170,902 |
| Spam dropped | — | 3,040 |
### 8. Feature enrichment — 23 engineered text features
All v3 files ending in `_enriched` or `_deberta` have 23 handcrafted features appended as additional columns. These are computed by `data_preprocessing_scripts/data_enrichment.py` and standardised with a fitted `StandardScaler` saved as `scaler.pkl`.
**Original 15 features:**
| Feature | Type | Description |
|---|---|---|
| `char_count` | int | Total character count |
| `word_count` | int | Total word count |
| `avg_word_length` | float | Mean word length |
| `uppercase_ratio` | float | Uppercase letters / all letters |
| `digit_ratio` | float | Digits / total characters |
| `special_char_ratio` | float | Non-alphanumeric, non-space / total characters |
| `exclamation_count` | int | Count of `!` |
| `question_mark_count` | int | Count of `?` |
| `has_url` | binary | Contains URL pattern |
| `url_count` | int | Number of URLs detected |
| `has_shortened_url` | binary | Contains bit.ly, t.co, etc. |
| `has_phone_number` | binary | Contains phone number (>=7 digits) |
| `has_email` | binary | Contains email address |
| `has_currency` | binary | Contains currency symbol or code |
| `urgency_score` | int | Count of urgency keywords matched |
**Evasion detection features (8 new):**
| Feature | Type | Description |
|---|---|---|
| `unicode_ratio` | float | Non-ASCII characters / total characters |
| `char_entropy` | float | Shannon entropy over character distribution |
| `suspicious_spacing` | int | Count of spaced-out word patterns (e.g. "w o r d") |
| `leet_ratio` | float | Characters that map to leet translations / total |
| `max_digit_run` | int | Longest consecutive digit sequence |
| `repeated_char_ratio` | float | Consecutive repeated chars / (length - 1) |
| `vocab_richness` | float | Unique words / total words |
| `has_obfuscated_url` | binary | Detects evasive URL patterns |
---
## Files
| File | Rows | Columns | Description |
|---|---|---|---|
| `dataset_v3.csv` | 442,282 | 3 | Cleaned and pruned base dataset (message, label, ai_generated) |
| `dataset_v3_for_deberta.csv` | 268,340 | 26 | Length-stratified undersampled + enriched — **used for DeBERTa training** |
| `original_dataset_v2.csv` | 813,546 | 3 | Cleaned Original Dataset |
| `original_dataset_v2_undersampled_stratified.csv` | 376,851 | 3 | Cleaned Original Dataset with random undersampling and length stratification applied |
## Column schema
```
message — SMS/MMS message text (string)
label — 0 = benign, 1 = spam/smishing
ai_generated — 0 = human-written, 1 = AI-generated
char_count — (enriched only) int
word_count — (enriched only) int
... — 21 more engineered features (see above)
```
---
## Data sources
The dataset is built from three primary collections merged via `data_preprocessing_scripts/merge_csvs.py`, supplemented by public datasets for benign diversity and LLM-generated synthetic spam.
### Primary sources
**1. Discord messages** (~745,400 messages) — benign conversational text
Private collection of Discord server exports in JSONL format, extracted and cleaned via `data_preprocessing_scripts/discord_preprocessor.py`. Provides the majority of benign conversational messages. Discord-specific syntax (mentions, custom emoji, invite links, tokens) is stripped during preprocessing.
**2. Combined public SMS datasets** (~74,900 messages) — ham + spam labelled
Merged from three public sources into `cleaned_extern_data.csv`:
- **UCI SMS Spam Collection** — 5,574 messages (4,827 ham + 747 spam). The foundational SMS spam benchmark.
- URL: https://archive.ics.uci.edu/dataset/228/sms+spam+collection
- HuggingFace mirror: https://huggingface.co/datasets/ucirvine/sms_spam
- Citation: Almeida, T.A., Gomez Hidalgo, J.M., Yamakami, A. (2011). *Contributions to the Study of SMS Spam Filtering: New Collection and Results.* Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11). DOI: [10.24432/C5CC84](https://doi.org/10.24432/C5CC84)
- **mshenoda/spam-messages** — Merged compilation of SMS Spam Collection, Telegram Spam Ham, and Enron Spam datasets.
- URL: https://huggingface.co/datasets/mshenoda/spam-messages
- Sub-sources:
- [thehamkercat/telegram-spam-ham](https://huggingface.co/datasets/thehamkercat/telegram-spam-ham) — Telegram message spam/ham classification
- [SetFit/enron_spam](https://huggingface.co/datasets/SetFit/enron_spam) — Spam/ham subset of the Enron email corpus (originally released by the Federal Energy Regulatory Commission, processed by CMU)
- **vinit9638/SMS-scam-detection-dataset** — 138,813 multilingual text entries. Filtered to English ham messages only.
- URL: https://github.com/vinit9638/SMS-scam-detection-dataset
**3. SpamDam Twitter data** (~36,700 messages) — SMS-like spam from social media
Spam and ham messages collected from Twitter/X, providing social media-style short text patterns.
- Citation: Li, Y., Zhang, R., Rong, W., Mi, X. (2024). *SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection.* arXiv: [2404.09481](https://arxiv.org/abs/2404.09481)
- Project page: https://chasesecurity.github.io/SpamDam/
### Supplementary public datasets
Used for benign diversity sourcing (via `source_samples.py`) to reduce false positives on legitimate service messages:
- **NUS SMS Corpus** — ~67,000 conversational SMS messages from National University of Singapore students. All benign.
- URL: https://github.com/WING-NUS/nus-sms-corpus
- Kaggle mirror: https://www.kaggle.com/datasets/rtatman/the-national-university-of-singapore-sms-corpus
- Citation: Chen, T. and Kan, M.-Y. (2013). *Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus.* Language Resources and Evaluation, 47(2), 299-355. DOI: [10.1007/s10579-012-9197-9](https://doi.org/10.1007/s10579-012-9197-9)
- **Mendeley SMS Phishing Dataset** — 5,971 messages (4,844 ham + 489 spam + 638 smishing). Ham messages used for benign diversity.
- URL: https://data.mendeley.com/datasets/f45bkkt8pr/1
- Citation: Mishra, S. and Soni, D. (2022). *SMS Phishing Dataset for Machine Learning and Pattern Recognition.* Mendeley Data, V1. DOI: [10.17632/f45bkkt8pr.1](https://doi.org/10.17632/f45bkkt8pr.1)
### Synthetic data
- **LLM-generated smishing** — AI-generated spam produced via a three-phase pipeline (`data_augment.py`) cycling through multiple local LLMs (Llama, Mistral, Qwen, Gemma, Granite, and others):
- Phase 1: Paraphrasing existing human spam with varied prompts
- Phase 2: Spear-smishing from person profiles x attack scenario templates
- Phase 3: Style-twist — benign messages transformed into smishing
- **LLM-generated hard negatives** — Synthetic benign messages generated via Claude (`source_samples.py`) targeting false positive categories identified in error analysis: retail promotions, bank transaction alerts, telecom notifications, delivery confirmations, appointment reminders, and OTP/verification codes.
提供机构:
notd5a



