SMS Spam Collection Data Set
收藏OpenDataLab2026-05-17 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/SMS_Spam_Collection_Data_Set
下载链接
链接失效反馈官方服务:
资源简介:
该语料库是从互联网上免费或免费的研究来源收集的:
从 Grumbletext 网站手动提取了 425 条 SMS 垃圾邮件集合。这是一个英国论坛,手机用户在该论坛上公开声明垃圾短信,其中大多数人没有报告收到的垃圾短信。识别索赔中的垃圾邮件文本是一项非常艰巨且耗时的任务,它涉及仔细扫描数百个网页。
新加坡国立大学短信语料库 (NSC) 中随机选择的 3,375 条短信的子集,这是新加坡国立大学计算机科学系为研究收集的大约 10,000 条合法短信的数据集。这些信息主要来自新加坡人,主要来自就读大学的学生。这些信息是从志愿者那里收集的,他们知道他们的贡献将被公开。
从 Caroline Tag 的博士论文中收集的 450 条 SMS 火腿消息的列表。
SMS Spam Corpus v.0.1 Big。它有 1,002 条 SMS 火腿消息和 322 条垃圾消息。
This corpus is collected from free or publicly available research sources obtained from the Internet:
1. A total of 425 SMS spam messages were manually extracted from the Grumbletext website. Grumbletext is a UK-based forum where mobile users publicly post about spam text messages they have received, though most of these users did not file official reports for the spam. Identifying spam message texts is an exceptionally arduous and time-consuming task, as it requires carefully scanning hundreds of web pages.
2. A random subset of 3,375 SMS messages was selected from the National University of Singapore SMS Corpus (NSC). The NSC is a research dataset comprising approximately 10,000 legitimate SMS messages collected by the Department of Computer Science at the National University of Singapore. The majority of these messages come from Singaporeans, predominantly university students. The data was gathered from volunteers who were informed that their submitted content would be made publicly available.
3. A list of 450 SMS ham messages was collected from Caroline Tag’s doctoral dissertation.
4. The SMS Spam Corpus v.0.1 Big, which contains 1,002 SMS ham messages and 322 spam messages.
提供机构:
OpenDataLab
创建时间:
2022-05-23
搜集汇总
数据集介绍

背景与挑战
背景概述
SMS Spam Collection Data Set是一个用于垃圾短信识别的文本分类数据集,包含从多个来源收集的短信数据,混合了合法和垃圾消息。该数据集由加州大学尔湾分校于2011年发布,适用于文本预训练和分类任务,具有广泛的应用价值。
以上内容由遇见数据集搜集并总结生成



