five

tanaos/synthetic-spam-detection-dataset-v1

收藏
Hugging Face2025-12-21 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/tanaos/synthetic-spam-detection-dataset-v1
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集由Tanaos使用Artifex Python库合成创建,旨在训练和评估垃圾邮件检测系统——即检测、分类或过滤未经请求的商业广告、欺诈性消息或其他不需要的文本内容的模型。数据集包含标记为`0`(非垃圾邮件)或`1`(垃圾邮件)的文本样本。垃圾邮件的类别包括:未经请求的商业或非商业宣传、欺诈性计划、网络钓鱼尝试、欺骗性或误导性信息、恶意软件或有害链接、成人内容或明确材料,以及过度使用大写或标点符号以吸引注意。数据集适用于训练和评估垃圾邮件检测模型,常见用例包括训练机器学习模型以分类文本消息是否为垃圾邮件、评估垃圾邮件检测算法的性能,以及微调预训练语言模型以进行垃圾邮件检测任务。

This dataset was created synthetically by Tanaos with the Artifex Python library. The dataset is designed to train and evaluate spam detection systems — models that detect, classify, or filter unsolicited commercial advertisement, fraudulent messages, or other unwanted content in text form. The dataset contains text samples labeled as either `0` (`not_spam`) or `1` (`spam`). The following categories are considered spam: unsolicited commercial advertisement or non-commercial proselytizing, fraudulent schemes, phishing attempts, content with deceptive or misleading information, malware or harmful links, adult content or explicit material, and excessive use of capitalization or punctuation to grab attention. The dataset is intended for training and evaluating spam detection models. Common use cases include training machine learning models to classify text messages as spam or not spam, evaluating the performance of spam detection algorithms, and fine-tuning pre-trained language models for spam detection tasks.
提供机构:
tanaos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作