UCI Spambase
收藏帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-257.html
下载链接
链接失效反馈官方服务:
资源简介:
The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998. (a) Hewlett-Packard Internal-only Technical Report. External forthcoming. (b) Determine whether a given email is spam or not. (c) ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.
"垃圾邮件"的范畴包罗万象:各类产品或网站广告、快速致富骗局、连锁转发邮件、色情内容……本次采集的垃圾邮件数据集来源于邮件管理员与主动上报垃圾邮件的个人用户。正常邮件数据集则取自已归档的工作与私人邮件,因此单词"george"与区号"650"可作为正常邮件的识别特征。这些特征在构建个性化垃圾邮件过滤器时颇具应用价值。若要构建通用型垃圾邮件过滤器,则要么需要屏蔽此类正常邮件特征,要么需要采集覆盖范围极广的正常邮件数据集。垃圾邮件相关背景资料参见:Cranor, Lorrie F.、LaMacchia, Brian A. 所著《Spam!》,发表于《ACM通讯(Communications of the ACM)》,41(8):74-83,1998年。(a) 仅面向惠普(Hewlett-Packard,HP)内部的技术报告,外部版本即将发布。(b) 任务目标:判断给定电子邮件是否为垃圾邮件。(c) 分类错误率约为7%。假阳性(false positives,即将正常邮件误判为垃圾邮件)的情况应尽量避免。若在训练与测试集中要求零假阳性,则将有20%~25%的垃圾邮件绕过该过滤器。
提供机构:
帕依提提



