five

qgyd2021/spam_detect

收藏
Hugging Face2023-12-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/qgyd2021/spam_detect
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含多个用于垃圾信息检测的子数据集,涵盖垃圾邮件、垃圾短信和广告识别等任务。每个子数据集的信息包括数据来源、语言、任务类型、原始数据/项目地址、样本个数、原始数据描述和替代数据下载地址。数据集中的样本示例展示了不同类型垃圾信息的内容和分类结果。

该数据集包含多个用于垃圾信息检测的子数据集,涵盖垃圾邮件、垃圾短信和广告识别等任务。每个子数据集的信息包括数据来源、语言、任务类型、原始数据/项目地址、样本个数、原始数据描述和替代数据下载地址。数据集中的样本示例展示了不同类型垃圾信息的内容和分类结果。
提供机构:
qgyd2021
原始信息汇总

垃圾信息检测数据集概述

数据集概览

该数据集用于垃圾信息检测,包括垃圾邮件、垃圾短信息和广告识别。数据集包含多种语言和任务类型的样本,主要以英语和汉语为主。

数据来源

数据集从多个公开资源中收集整理,具体如下:

数据集名称 语言 任务类型 原始数据/项目地址 样本个数 原始数据描述 替代数据下载地址
enron_spam 英语 垃圾邮件分类 enron_spam_data; Enron-Spam; spam-mails-dataset ham: 16545; spam: 17171 Enron-Spam 数据集是 V. Metsis、I. Androutsopoulos 和 G. Paliouras 收集的绝佳资源 SetFit/enron_spam; enron-spam
enron_spam_subset 英语 垃圾邮件分类 email-spam-dataset ham: 5000; spam: 5000
ling_spam 英语 垃圾邮件分类 lingspam-dataset; email-spam-dataset ham: 2172; spam: 433 Ling-Spam 数据集是从语言学家列表中整理的 2,893 条垃圾邮件和非垃圾邮件消息的集合。
sms_spam 英语 垃圾短信分类 SMS Spam Collection; SMS Spam Collection Dataset ham: 4827; spam: 747 SMS 垃圾邮件集合是一组公开的 SMS 标记消息,为移动电话垃圾邮件研究而收集。 sms_spam
sms_spam_collection 英语 垃圾短信分类 spam-emails ham: 4825; spam: 747 该数据集包含电子邮件的集合 email-spam-detection-dataset-classification; spam-identification; sms-spam-collection; spam-or-ham
spam_assassin 英语 垃圾邮件分类 datasets-spam-assassin; Apache SpamAssassin’s public datasets; Spam or Not Spam Dataset ham: 4150; spam: 1896 数据集从email-spam-dataset的completeSpamAssassin.csv文件而来。 email-spam-dataset; talby/SpamAssassin; spamassassin-2002
spam_base 英语 垃圾邮件分类 spambase 将电子邮件分类为垃圾邮件或非垃圾邮件 spam-email-data-uci
spam_detection 英语 垃圾短信分类 Deysi/spam-detection-dataset ham: 5400; spam: 5500
spam_message 汉语 垃圾短信分类 SpamMessage ham: 720000; spam: 80000 其中spam的数据是正确的数据,但是做了脱敏处理(招生电话:xxxxxxxxxxx),这里的 x 可能会成为显著特征。而ham样本像是从普通文本中截断出来充作样本的,建议不要用这些数据。
spam_message_lr 汉语 垃圾短信分类 SpamMessagesLR ham: 3983; spam: 6990
trec07p 英语 垃圾邮件分类 2007 TREC Public Spam Corpus; Spam Track ham: 25220; spam: 50199 2007 TREC Public Spam Corpus trec07p.tar.gz
trec06c 汉语 垃圾邮件分类 2006 TREC Public Spam Corpora 2006 TREC Public Spam Corpora
youtube_spam_collection 英语 垃圾评论分类 youtube+spam+collection; YouTube Spam Collection Data Set ham: 951; spam: 1005 它是为垃圾邮件研究而收集的公共评论集。

样本示例

enron_spam 样本示例

wanted to try ci 4 lis but thought it was way too expensive for you ? <br> viagra at $ 1 . 12 per dose ready to boost your sex life ? positive ? time to do it right now . order viagra at incredibly low prices $ 1 . 12 per dose . unbelivable remove <br> spam

enron / hpl actuals for december 11 , 2000 <br> teco tap 30 . 000 / enron ; 120 . 000 / hpl gas daily ls hpl lsk ic 30 . 000 / enron ham

looking for cheap high - quality software ? rotated napoleonizes <br> water past also , burn , course . gave country , mass lot . act north good . from , learn form most brother vary . when more for . up stick , century put , song be . test , describe , plain , against wood star . began dress ever group . here oh , most world stay . <br> spam

ideabank website <br> please read the attached document for information about an exciting new website for ets employees ! ham

enron_spam_subset 样本示例

Subject: edrugs online viagra - proven step to start something all over again . nothing is more useful than silence . teachers open the door . you enter by yourself . how sharper than a serpent s tooth it isto have a thankless child ! spam

Subject: start date : 12 / 13 / 01 ; hourahead hour : 5 ; start date : 12 / 13 / 01 ; hourahead hour : 5 ; no ancillary schedules awarded . no variances detected . log messages : parsing file - - > > o : portland westdesk california scheduling iso final schedules 2001121305 . txt ham

Subject: cheapestt medss ! mediccationns at lowesst pricess everyy ! over 80 . % offf , pricess wontt get lowerr we selll vic od ( in v , ia . gra x , ana . x http : / / www . pr 3 sdlugs . com / ? refid = 87 <br> spam

Subject: fw : picture

the following is an aerial photo of the wtc area . it kinda brings on vertigo , but is a phenomenal shot .

http : / / userwww . service . emory . edu / ~ rdgarr / wtc . htm ham


ling_spam 样本示例

Subject: internet specialist 007 - the spy <br> internet specialist 007 - the spy learn everything about your friends , neighbors , enemies , employees or anyone else ! - - even your boss ! - - even yourself ! this mammoth snoop collection of internet sites will provide you the newest and most current addresses available on the net today . . . = 20 * track down an old debt , or anyone else that has done you wrong ! it s incredible , and so many new data sites have come online in the past 90 days . . . * over 300 giant resources to look up people , credit , social security , current or past employment , mail order purchases , = 20 addresses , phone numbers , maps to city locations . . . * investigate your family history ! check birth , death , adoption or social security records check service records or army , navy , air force or = 20 marine corps . * locate an old friend ( or an enemy who is hiding ) or a lost = 20 love - - find e-mail , telephone or address information on anyone ! = 20 even look up * unlisted * phone numbers ! * find work by searching classified ads all over the world ! * screen prospective employees - - check credit , driving or criminal records verify income or educational accomplishments = 20 * check out your daughter s new boyfriend ! * find trial transcripts and court orders ! * enjoy the enchantment of finding out a juicy tid-bit about a co-worker . the internet is a powerful megasource of information , = 20 if you only know where to look . i tell you how to find = 20 out nearly anything about anybody , and tell you exactly where to find it ! you will be amazed to find out what personal information = 20 other people can find out about you ! check your credit = 20 report so you can correct wrong information that may be = 20 used to deny you credit . research yourself first ! you ll be horrified , as i was , = 20 at how much data has been accumulated about you . any my huge collection is only the beginning ! once you = 20 locate these free private , college and government web sites , you ll find even more links to even more = 20 information search engines ! = 20 if you believe ( like i do ) that the information that is stored about each one of us should be freely accessible , you ll want to see the snoop collection i ve compiled . verify your own records , or find out what you need to = 20 know about others . i m telling you , it s incredible what you can find out using the internet ! we will accept checks by fax at 813-269 - 9651 or > > > send $ 14 . 95 cash , check or money order to : > > > the coldwell group > > > p . o . box 3787 > > > dept 1007 > > > petersburg , va 23805 i will rush back to you my snoop information for fastest service include your * e-mail * address . = 20 * what information is available - - and exact url to get there ! * exactly where to look for - - and the clever way to use - - = 20 the above search engines , and tons more ! * my easy-to - browse categorized megacenter of information has my own description of how to use each site , and what you ll find when you get there - - and tricky tips on how to = 20 extract the best data ! you can know everything about everybody with this internet specialist collection ! * * soon to be available - - the most complete international internet spy = 20 sites available on the web today * * don t miss this one or you ll be sorry = 20 to be removed from our list please fax your address to 813-269 - 9651 . l = e3 = 01 @ u = 0b <br> spam

Subject: usage - based models - symposium <br> announcing the sixth biennial symposium of the rice university department of linguistics usage-based models of language rice university march 15-18 , 1995 invited speakers : mira ariel tel aviv university joan bybee university of new mexico john du bois university of california , santa barbara michael israel university of california , san diego sydney lamb rice university ronald langacker university of california , san diego tom givon university of oregon brian macwhinney carnegie - mellon university janet pierrehumbert northwestern university john sinclair university of birmingham ( u . k . ) arie verhagen university of utrecht description : the goal of this symposium is to explore approaches to linguistic theory that have in common the aim of accounting for linguistic usage . the empirical data for such theories is not restricted to linguistic intuitions about acceptibility , but comes from usage events of varied types . the focus is on the patterns found in the various sorts of usage data examined , and how those patterns can be extracted , represented , and used by the human mind . research from a variety of traditions will be represented , including corpus-based analyses , discourse studies , experimental studies of language processing and language acquisition , and instrumental phonetics . the approaches taken can be called data-driven , rather than model-driven , in that the fewest possible prior assumptions are made about what types of data are relevant , and that large sets of usage events are observed so that the detailed patterns found in actual usage can emerge . moreover , the various approaches taken show signs of converging toward a view of language as a dynamic system in which linguistic knowledge is not separate from its processing in language use . the linguistic models representing this view are usage-based by virtue of three factors : ( 1 ) the importance placed on usage data for theory construction ; ( 2 ) the direct incorporation of processing ( production and comprehension ) into linguistic theory ; and ( 3 ) the requirement that the models arrived at , whatever the direct source of evidence , must be testable with reference to language use . registration : no charge . symposium attendance on a space-available basis . for further information , contact suzanne kemmer ( kemmer @ ruf . rice . edu ) or michael barlow ( barlow @ ruf . rice . edu ) snailmail : dept . of linguistics , rice university , houston tx 77251-1892 . <br> ham

Subject: domani <br> new improved with free software , free bulk e mail system , free web site = to do what you wish , ongoing support ( optional ) , and a lot more ! all = included . . . . . . . . . . . this is a one time mailing . . . . . . . . . . . . . . . $ = you are about to make at least $ 50 , 000 in less than 90 days read the enclosed program . . . then read it again . . . / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /

搜集汇总
数据集介绍
main_image_url
构建方式
该数据集是由一系列公开收集的垃圾邮件、垃圾短信和广告数据构成的。数据集涵盖了多种语言,包括英语和汉语。数据集的构建主要来源于网络,包括电子邮件、短信和评论等。为了便于研究,数据集进行了分类整理,将数据分为垃圾信息和非垃圾信息两类。
特点
该数据集的特点是数据类型多样,包括电子邮件、短信和评论等。数据集涵盖了多种语言,包括英语和汉语。数据集规模较大,包含大量垃圾信息和非垃圾信息样本。数据集已经进行了分类整理,方便研究人员进行垃圾信息检测研究。
使用方法
使用该数据集时,首先需要了解数据集的结构和特点。数据集已经进行了分类整理,将数据分为垃圾信息和非垃圾信息两类。研究人员可以根据需要选择合适的数据集进行垃圾信息检测研究。此外,数据集还提供了样本示例,方便研究人员了解数据集的内容和格式。
背景与挑战
背景概述
在信息技术迅速发展的今天,垃圾信息检测已成为保障信息安全的重要手段。qgyd2021/spam_detect数据集专注于垃圾邮件、垃圾短信和广告的识别,旨在通过机器学习和自然语言处理技术,提高垃圾信息的识别准确率,从而减少用户受到垃圾信息干扰的可能性。该数据集涵盖了英语和汉语两种语言,收集自网络的不同来源,包括Enron邮件数据集、SMS Spam Collection、SpamAssassin等,由V. Metsis、I. Androutsopoulos和G. Paliouras等研究人员收集整理,为垃圾信息检测领域的研究提供了宝贵的资源。
当前挑战
qgyd2021/spam_detect数据集在解决垃圾信息检测问题的同时,也面临着诸多挑战。首先,垃圾信息的种类繁多,包括垃圾邮件、垃圾短信和广告等,每种类型的垃圾信息都有其独特的特征和表现形式,给垃圾信息检测带来了困难。其次,垃圾信息的生成和传播速度极快,需要不断更新和优化检测模型,以适应不断变化的环境。此外,垃圾信息检测模型在准确性和效率之间需要做出权衡,如何在保证准确性的同时提高检测效率,也是垃圾信息检测领域亟待解决的问题。
常用场景
经典使用场景
在信息时代,垃圾信息检测是一项至关重要的任务,旨在识别并过滤掉电子邮件、短信和社交媒体评论中的垃圾邮件。qgyd2021/spam_detect数据集为这一任务提供了丰富的资源,包含了多种语言和多种平台的垃圾信息样本。该数据集的经典使用场景是训练机器学习模型,以自动识别垃圾邮件,保护用户免受不必要的干扰和潜在的安全威胁。
衍生相关工作
qgyd2021/spam_detect数据集衍生了大量的相关工作,包括基于深度学习的垃圾邮件检测模型、基于多语言特征的垃圾邮件检测算法以及基于行为分析的垃圾邮件检测技术等。这些工作不仅丰富了垃圾信息检测领域的理论体系,也为实际应用提供了更多的解决方案。
数据集最近研究
最新研究方向
垃圾信息检测领域,尤其是在邮件和短信分类方面,目前的研究重点在于提高分类的准确性和模型的泛化能力。研究人员正在探索更先进的自然语言处理技术,如深度学习、转移学习等,以更好地理解和处理垃圾信息。同时,跨语言垃圾信息检测也成为一个热点研究方向,旨在开发能够有效识别多语言垃圾信息的模型。此外,由于垃圾信息的形式和内容不断变化,如何使模型能够适应这种变化,以及如何利用外部知识和上下文信息来提高分类效果,也是当前研究的重要课题。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作