SMS垃圾邮件收集数据集,标记为垃圾邮件或合法短信的集合
收藏帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-1890.html
下载链接
链接失效反馈官方服务:
资源简介:
SMS Spam Corpus v.0.1是一组SMS短信标签消息,已收集用于SMS Spam研究。它包含两个英文短信息集合,包含1084和1319条消息,标记为合法或垃圾邮件。 The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text. This corpus has been collected from free or free for research sources at the Internet: -> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web link]. This corpus has been used in the following academic researches:
短信垃圾短信语料库(SMS Spam Corpus)v.0.1是专为短信垃圾信息研究收集的带标注短信数据集。该数据集包含5574条英文短信,所有短信均被标记为正常短信(ham,即合法短信)或垃圾短信(spam)。数据集文件采用每行存储一条短信的格式,每行包含两列内容:第一列v1存储标签(ham或spam),第二列v2存储原始短信文本。本语料库的所有数据均取自互联网上的免费或科研免费使用资源,具体来源如下:
1. 从Grumbletext网站手动提取的425条垃圾短信。Grumbletext是英国本土手机用户公开反馈垃圾短信的论坛,多数反馈未附带实际收到的垃圾短信原文。从用户反馈中识别垃圾短信原文是一项极具挑战性且耗时耗力的工作,需逐一浏览数百个网页方可完成。Grumbletext网站链接:[网页链接]。
2. 取自新加坡国立大学短信语料库(NUS SMS Corpus,简称NSC)的3375条正常短信子集。NSC是新加坡国立大学计算机科学系为科研项目收集的约10000条合法短信数据集,短信发送者多为新加坡民众,且以该校在校学生为主。所有短信均由知情志愿者提供,志愿者明确知晓其提交的内容将被公开获取。NUS SMS Corpus公开获取链接:[网页链接]。
3. 从Caroline Tag的博士论文中收集的450条正常短信,其公开获取链接为:[网页链接]。
4. 本语料库还纳入了SMS Spam Corpus v.0.1 Big版本,该版本包含1002条正常短信与322条垃圾短信,公开获取链接为:[网页链接]。该语料库曾被应用于以下学术研究:
提供机构:
帕依提提



