LHRS-UM-FERI/MENTHOS-dataset-spam
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/LHRS-UM-FERI/MENTHOS-dataset-spam
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- sl
tags:
- menthos
- spam
- phishing
- binary-classification
size_categories:
- 10K<n<100K
---
# MENTHOS-dataset-spam
## English
### About
MENTHOS-Spam is a binary text classification dataset for spam/phishing detection. It is created by combining email and SMS spam sources and then producing train/validation/test splits.
### Source Data
Original sources:
- Phishing Email Dataset (Kaggle): https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset
- SMS Spam Collection (Kaggle mirror of UCI dataset): https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
### Processing and Balancing
- Unified schema: `text`, `label`
- Label mapping for SMS source: `ham -> 0`, `spam -> 1`
- Split: 70% train, 15% validation, 15% test
- Balancing: each split is balanced by downsampling to the minority class count
### Splits and Class Distribution
The prepared train, validation, and test splits are included with the dataset release.
| split | rows | label 0 | label 1 |
| ---------- | ----: | ------: | ------: |
| train | 60982 | 30491 | 30491 |
| validation | 13166 | 6583 | 6583 |
| test | 13128 | 6564 | 6564 |
### Citation
```
@misc{borovic_li-dobnik_kranjec_ferme_2026,
title = {MENTHOS-dataset-spam},
author = {Borovic, Li Dobnik, Kranjec, Ferme},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/LHRS-UM-FERI/MENTHOS-dataset-spam}}
}
```
---
## Slovenščina
### O datasetu
MENTHOS-Spam je binarni dataset za detekcijo spam/phishing besedil. Nastane z združitvijo e-poštnega in SMS vira ter z gradnjo train/validation/test delitev.
### Izvorni podatki
Izvor:
- Phishing Email Dataset (Kaggle): https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset
- SMS Spam Collection (Kaggle ogledalo UCI): https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
### Obdelava in uravnoteženje
- Enotna shema: `text`, `label`
- Preslikava oznak za SMS vir: `ham -> 0`, `spam -> 1`
- Delitev: 70% train, 15% validation, 15% test
- Uravnoteženje: vsak split je uravnotežen z downsamplingom na manjšinski razred
### Delitve in porazdelitev razredov
Pripravljene train, validation in test delitve so vključene v izdajo nabora podatkov.
| split | vrstic | label 0 | label 1 |
| ---------- | -----: | ------: | ------: |
| train | 60982 | 30491 | 30491 |
| validation | 13166 | 6583 | 6583 |
| test | 13128 | 6564 | 6564 |
### Citiranje
```
@misc{borovic_li-dobnik_kranjec_ferme_2026,
title = {MENTHOS-dataset-spam},
author = {Borovic, Li Dobnik, Kranjec, Ferme},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/LHRS-UM-FERI/MENTHOS-dataset-spam}}
}
```
提供机构:
LHRS-UM-FERI



