five

A Balanced Dataset for Spam and Smishing Detection using Large Language Models (LLMs)

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/vmg875v4xs
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains 10,191 labeled SMS messages for training and testing spam and smishing detection machine learning models. A large language model (LLM) was trained to create this dataset. Structure This dataset contains five columns: • LABEL: A categorical value indicating the type of message. The values are: o Ham: Benign (non-malicious) message o Spam: Unsolicited or junk message o Smishing: SMS phishing message to deceive recipients into giving away their sensitive personal information • TEXT: The content of the message • URL: Indicates whether a URL is present in the message (Yes/No) • EMAIL: Indicates whether an email address is present in the message (Yes/No) • PHONE: Indicates whether a phone number is present in the message (Yes/No) Key Features The dataset is balanced to prevent bias in classification tasks: • ham: 3,397 messages • spam: 3,397 messages • smishing: 3,397 messages Source and Citation The following publicly available dataset is used for training of the LLM: Mishra, Sandhya; Soni, Devpriya (2022), “SMS PHISHING DATASET FOR MACHINE LEARNING AND PATTERN RECOGNITION”, Mendeley Data, V1, doi: 10.17632/f45bkkt8pr.1 Use Cases • Text classification research • Phishing and fraud detection models • LLM fine-tuning or prompt engineering for safety and content moderation • Educational demonstrations in cybersecurity, machine learning (ML) or natural language processing (NLP)
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作