A Balanced Dataset for Spam and Smishing Detection using Large Language Models (LLMs)

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/vmg875v4xs

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains 10,191 labeled SMS messages for training and testing spam and smishing detection machine learning models. A large language model (LLM) was trained to create this dataset. Structure This dataset contains five columns: • LABEL: A categorical value indicating the type of message. The values are: o Ham: Benign (non-malicious) message o Spam: Unsolicited or junk message o Smishing: SMS phishing message to deceive recipients into giving away their sensitive personal information • TEXT: The content of the message • URL: Indicates whether a URL is present in the message (Yes/No) • EMAIL: Indicates whether an email address is present in the message (Yes/No) • PHONE: Indicates whether a phone number is present in the message (Yes/No) Key Features The dataset is balanced to prevent bias in classification tasks: • ham: 3,397 messages • spam: 3,397 messages • smishing: 3,397 messages Source and Citation The following publicly available dataset is used for training of the LLM: Mishra, Sandhya; Soni, Devpriya (2022), “SMS PHISHING DATASET FOR MACHINE LEARNING AND PATTERN RECOGNITION”, Mendeley Data, V1, doi: 10.17632/f45bkkt8pr.1 Use Cases • Text classification research • Phishing and fraud detection models • LLM fine-tuning or prompt engineering for safety and content moderation • Educational demonstrations in cybersecurity, machine learning (ML) or natural language processing (NLP)

创建时间：

2025-07-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集