five

LHRS-UM-FERI/MENTHOS-dataset-spam

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/LHRS-UM-FERI/MENTHOS-dataset-spam
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - sl tags: - menthos - spam - phishing - binary-classification size_categories: - 10K<n<100K --- # MENTHOS-dataset-spam ## English ### About MENTHOS-Spam is a binary text classification dataset for spam/phishing detection. It is created by combining email and SMS spam sources and then producing train/validation/test splits. ### Source Data Original sources: - Phishing Email Dataset (Kaggle): https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset - SMS Spam Collection (Kaggle mirror of UCI dataset): https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset ### Processing and Balancing - Unified schema: `text`, `label` - Label mapping for SMS source: `ham -> 0`, `spam -> 1` - Split: 70% train, 15% validation, 15% test - Balancing: each split is balanced by downsampling to the minority class count ### Splits and Class Distribution The prepared train, validation, and test splits are included with the dataset release. | split | rows | label 0 | label 1 | | ---------- | ----: | ------: | ------: | | train | 60982 | 30491 | 30491 | | validation | 13166 | 6583 | 6583 | | test | 13128 | 6564 | 6564 | ### Citation ``` @misc{borovic_li-dobnik_kranjec_ferme_2026, title = {MENTHOS-dataset-spam}, author = {Borovic, Li Dobnik, Kranjec, Ferme}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/LHRS-UM-FERI/MENTHOS-dataset-spam}} } ``` --- ## Slovenščina ### O datasetu MENTHOS-Spam je binarni dataset za detekcijo spam/phishing besedil. Nastane z združitvijo e-poštnega in SMS vira ter z gradnjo train/validation/test delitev. ### Izvorni podatki Izvor: - Phishing Email Dataset (Kaggle): https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset - SMS Spam Collection (Kaggle ogledalo UCI): https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset ### Obdelava in uravnoteženje - Enotna shema: `text`, `label` - Preslikava oznak za SMS vir: `ham -> 0`, `spam -> 1` - Delitev: 70% train, 15% validation, 15% test - Uravnoteženje: vsak split je uravnotežen z downsamplingom na manjšinski razred ### Delitve in porazdelitev razredov Pripravljene train, validation in test delitve so vključene v izdajo nabora podatkov. | split | vrstic | label 0 | label 1 | | ---------- | -----: | ------: | ------: | | train | 60982 | 30491 | 30491 | | validation | 13166 | 6583 | 6583 | | test | 13128 | 6564 | 6564 | ### Citiranje ``` @misc{borovic_li-dobnik_kranjec_ferme_2026, title = {MENTHOS-dataset-spam}, author = {Borovic, Li Dobnik, Kranjec, Ferme}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/LHRS-UM-FERI/MENTHOS-dataset-spam}} } ```
提供机构:
LHRS-UM-FERI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作