A Balanced Dataset for Spam and Smishing Detection using Large Language Models (LLMs)
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/vmg875v4xs
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains 10,191 labeled SMS messages for training and testing spam and smishing detection machine learning models. A large language model (LLM) was trained to create this dataset.
Structure
This dataset contains five columns:
• LABEL: A categorical value indicating the type of message. The values are:
o Ham: Benign (non-malicious) message
o Spam: Unsolicited or junk message
o Smishing: SMS phishing message to deceive recipients into giving away their sensitive personal information
• TEXT: The content of the message
• URL: Indicates whether a URL is present in the message (Yes/No)
• EMAIL: Indicates whether an email address is present in the message (Yes/No)
• PHONE: Indicates whether a phone number is present in the message (Yes/No)
Key Features
The dataset is balanced to prevent bias in classification tasks:
• ham: 3,397 messages
• spam: 3,397 messages
• smishing: 3,397 messages
Source and Citation
The following publicly available dataset is used for training of the LLM:
Mishra, Sandhya; Soni, Devpriya (2022), “SMS PHISHING DATASET FOR MACHINE LEARNING AND PATTERN RECOGNITION”, Mendeley Data, V1, doi: 10.17632/f45bkkt8pr.1
Use Cases
• Text classification research
• Phishing and fraud detection models
• LLM fine-tuning or prompt engineering for safety and content moderation
• Educational demonstrations in cybersecurity, machine learning (ML) or natural language processing (NLP)
创建时间:
2025-07-04



