five

qgyd2021/spam_detect|垃圾信息检测数据集|数据分析数据集

收藏
hugging_face2023-12-05 更新2024-03-04 收录
垃圾信息检测
数据分析
下载链接:
https://hf-mirror.com/datasets/qgyd2021/spam_detect
下载链接
链接失效反馈
资源简介:
该数据集包含多个用于垃圾信息检测的子数据集,涵盖垃圾邮件、垃圾短信和广告识别等任务。每个子数据集的信息包括数据来源、语言、任务类型、原始数据/项目地址、样本个数、原始数据描述和替代数据下载地址。数据集中的样本示例展示了不同类型垃圾信息的内容和分类结果。

该数据集包含多个用于垃圾信息检测的子数据集,涵盖垃圾邮件、垃圾短信和广告识别等任务。每个子数据集的信息包括数据来源、语言、任务类型、原始数据/项目地址、样本个数、原始数据描述和替代数据下载地址。数据集中的样本示例展示了不同类型垃圾信息的内容和分类结果。
提供机构:
qgyd2021
原始信息汇总

垃圾信息检测数据集概述

数据集概览

该数据集用于垃圾信息检测,包括垃圾邮件、垃圾短信息和广告识别。数据集包含多种语言和任务类型的样本,主要以英语和汉语为主。

数据来源

数据集从多个公开资源中收集整理,具体如下:

数据集名称 语言 任务类型 原始数据/项目地址 样本个数 原始数据描述 替代数据下载地址
enron_spam 英语 垃圾邮件分类 enron_spam_data; Enron-Spam; spam-mails-dataset ham: 16545; spam: 17171 Enron-Spam 数据集是 V. Metsis、I. Androutsopoulos 和 G. Paliouras 收集的绝佳资源 SetFit/enron_spam; enron-spam
enron_spam_subset 英语 垃圾邮件分类 email-spam-dataset ham: 5000; spam: 5000
ling_spam 英语 垃圾邮件分类 lingspam-dataset; email-spam-dataset ham: 2172; spam: 433 Ling-Spam 数据集是从语言学家列表中整理的 2,893 条垃圾邮件和非垃圾邮件消息的集合。
sms_spam 英语 垃圾短信分类 SMS Spam Collection; SMS Spam Collection Dataset ham: 4827; spam: 747 SMS 垃圾邮件集合是一组公开的 SMS 标记消息,为移动电话垃圾邮件研究而收集。 sms_spam
sms_spam_collection 英语 垃圾短信分类 spam-emails ham: 4825; spam: 747 该数据集包含电子邮件的集合 email-spam-detection-dataset-classification; spam-identification; sms-spam-collection; spam-or-ham
spam_assassin 英语 垃圾邮件分类 datasets-spam-assassin; Apache SpamAssassin’s public datasets; Spam or Not Spam Dataset ham: 4150; spam: 1896 数据集从email-spam-dataset的completeSpamAssassin.csv文件而来。 email-spam-dataset; talby/SpamAssassin; spamassassin-2002
spam_base 英语 垃圾邮件分类 spambase 将电子邮件分类为垃圾邮件或非垃圾邮件 spam-email-data-uci
spam_detection 英语 垃圾短信分类 Deysi/spam-detection-dataset ham: 5400; spam: 5500
spam_message 汉语 垃圾短信分类 SpamMessage ham: 720000; spam: 80000 其中spam的数据是正确的数据,但是做了脱敏处理(招生电话:xxxxxxxxxxx),这里的 x 可能会成为显著特征。而ham样本像是从普通文本中截断出来充作样本的,建议不要用这些数据。
spam_message_lr 汉语 垃圾短信分类 SpamMessagesLR ham: 3983; spam: 6990
trec07p 英语 垃圾邮件分类 2007 TREC Public Spam Corpus; Spam Track ham: 25220; spam: 50199 2007 TREC Public Spam Corpus trec07p.tar.gz
trec06c 汉语 垃圾邮件分类 2006 TREC Public Spam Corpora 2006 TREC Public Spam Corpora
youtube_spam_collection 英语 垃圾评论分类 youtube+spam+collection; YouTube Spam Collection Data Set ham: 951; spam: 1005 它是为垃圾邮件研究而收集的公共评论集。

样本示例

enron_spam 样本示例

wanted to try ci 4 lis but thought it was way too expensive for you ? <br> viagra at $ 1 . 12 per dose ready to boost your sex life ? positive ? time to do it right now . order viagra at incredibly low prices $ 1 . 12 per dose . unbelivable remove <br> spam

enron / hpl actuals for december 11 , 2000 <br> teco tap 30 . 000 / enron ; 120 . 000 / hpl gas daily ls hpl lsk ic 30 . 000 / enron ham

looking for cheap high - quality software ? rotated napoleonizes <br> water past also , burn , course . gave country , mass lot . act north good . from , learn form most brother vary . when more for . up stick , century put , song be . test , describe , plain , against wood star . began dress ever group . here oh , most world stay . <br> spam

ideabank website <br> please read the attached document for information about an exciting new website for ets employees ! ham

enron_spam_subset 样本示例

Subject: edrugs online viagra - proven step to start something all over again . nothing is more useful than silence . teachers open the door . you enter by yourself . how sharper than a serpent s tooth it isto have a thankless child ! spam

Subject: start date : 12 / 13 / 01 ; hourahead hour : 5 ; start date : 12 / 13 / 01 ; hourahead hour : 5 ; no ancillary schedules awarded . no variances detected . log messages : parsing file - - > > o : portland westdesk california scheduling iso final schedules 2001121305 . txt ham

Subject: cheapestt medss ! mediccationns at lowesst pricess everyy ! over 80 . % offf , pricess wontt get lowerr we selll vic od ( in v , ia . gra x , ana . x http : / / www . pr 3 sdlugs . com / ? refid = 87 <br> spam

Subject: fw : picture

the following is an aerial photo of the wtc area . it kinda brings on vertigo , but is a phenomenal shot .

http : / / userwww . service . emory . edu / ~ rdgarr / wtc . htm ham


ling_spam 样本示例

Subject: internet specialist 007 - the spy <br> internet specialist 007 - the spy learn everything about your friends , neighbors , enemies , employees or anyone else ! - - even your boss ! - - even yourself ! this mammoth snoop collection of internet sites will provide you the newest and most current addresses available on the net today . . . = 20 * track down an old debt , or anyone else that has done you wrong ! it s incredible , and so many new data sites have come online in the past 90 days . . . * over 300 giant resources to look up people , credit , social security , current or past employment , mail order purchases , = 20 addresses , phone numbers , maps to city locations . . . * investigate your family history ! check birth , death , adoption or social security records check service records or army , navy , air force or = 20 marine corps . * locate an old friend ( or an enemy who is hiding ) or a lost = 20 love - - find e-mail , telephone or address information on anyone ! = 20 even look up * unlisted * phone numbers ! * find work by searching classified ads all over the world ! * screen prospective employees - - check credit , driving or criminal records verify income or educational accomplishments = 20 * check out your daughter s new boyfriend ! * find trial transcripts and court orders ! * enjoy the enchantment of finding out a juicy tid-bit about a co-worker . the internet is a powerful megasource of information , = 20 if you only know where to look . i tell you how to find = 20 out nearly anything about anybody , and tell you exactly where to find it ! you will be amazed to find out what personal information = 20 other people can find out about you ! check your credit = 20 report so you can correct wrong information that may be = 20 used to deny you credit . research yourself first ! you ll be horrified , as i was , = 20 at how much data has been accumulated about you . any my huge collection is only the beginning ! once you = 20 locate these free private , college and government web sites , you ll find even more links to even more = 20 information search engines ! = 20 if you believe ( like i do ) that the information that is stored about each one of us should be freely accessible , you ll want to see the snoop collection i ve compiled . verify your own records , or find out what you need to = 20 know about others . i m telling you , it s incredible what you can find out using the internet ! we will accept checks by fax at 813-269 - 9651 or > > > send $ 14 . 95 cash , check or money order to : > > > the coldwell group > > > p . o . box 3787 > > > dept 1007 > > > petersburg , va 23805 i will rush back to you my snoop information for fastest service include your * e-mail * address . = 20 * what information is available - - and exact url to get there ! * exactly where to look for - - and the clever way to use - - = 20 the above search engines , and tons more ! * my easy-to - browse categorized megacenter of information has my own description of how to use each site , and what you ll find when you get there - - and tricky tips on how to = 20 extract the best data ! you can know everything about everybody with this internet specialist collection ! * * soon to be available - - the most complete international internet spy = 20 sites available on the web today * * don t miss this one or you ll be sorry = 20 to be removed from our list please fax your address to 813-269 - 9651 . l = e3 = 01 @ u = 0b <br> spam

Subject: usage - based models - symposium <br> announcing the sixth biennial symposium of the rice university department of linguistics usage-based models of language rice university march 15-18 , 1995 invited speakers : mira ariel tel aviv university joan bybee university of new mexico john du bois university of california , santa barbara michael israel university of california , san diego sydney lamb rice university ronald langacker university of california , san diego tom givon university of oregon brian macwhinney carnegie - mellon university janet pierrehumbert northwestern university john sinclair university of birmingham ( u . k . ) arie verhagen university of utrecht description : the goal of this symposium is to explore approaches to linguistic theory that have in common the aim of accounting for linguistic usage . the empirical data for such theories is not restricted to linguistic intuitions about acceptibility , but comes from usage events of varied types . the focus is on the patterns found in the various sorts of usage data examined , and how those patterns can be extracted , represented , and used by the human mind . research from a variety of traditions will be represented , including corpus-based analyses , discourse studies , experimental studies of language processing and language acquisition , and instrumental phonetics . the approaches taken can be called data-driven , rather than model-driven , in that the fewest possible prior assumptions are made about what types of data are relevant , and that large sets of usage events are observed so that the detailed patterns found in actual usage can emerge . moreover , the various approaches taken show signs of converging toward a view of language as a dynamic system in which linguistic knowledge is not separate from its processing in language use . the linguistic models representing this view are usage-based by virtue of three factors : ( 1 ) the importance placed on usage data for theory construction ; ( 2 ) the direct incorporation of processing ( production and comprehension ) into linguistic theory ; and ( 3 ) the requirement that the models arrived at , whatever the direct source of evidence , must be testable with reference to language use . registration : no charge . symposium attendance on a space-available basis . for further information , contact suzanne kemmer ( kemmer @ ruf . rice . edu ) or michael barlow ( barlow @ ruf . rice . edu ) snailmail : dept . of linguistics , rice university , houston tx 77251-1892 . <br> ham

Subject: domani <br> new improved with free software , free bulk e mail system , free web site = to do what you wish , ongoing support ( optional ) , and a lot more ! all = included . . . . . . . . . . . this is a one time mailing . . . . . . . . . . . . . . . $ = you are about to make at least $ 50 , 000 in less than 90 days read the enclosed program . . . then read it again . . . / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /

用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国区域交通网络数据集

该数据集包含中国各区域的交通网络信息,包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。

data.stats.gov.cn 收录

中国1km分辨率逐月降水量数据集(1901-2023)

该数据集为中国逐月降水量数据,空间分辨率为0.0083333°(约1km),时间为1901.1-2023.12。数据格式为NETCDF,即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集,通过Delta空间降尺度方案在中国降尺度生成的。并且,使用496个独立气象观测点数据进行验证,验证结果可信。本数据集包含的地理空间范围是全国主要陆地(包含港澳台地区),不含南海岛礁等区域。为了便于存储,数据均为int16型存于nc文件中,降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理,Matlab发布了读入与存储nc文件的函数,读取函数为ncread,切换到nc文件存储文件夹,语句表达为:ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent]),其中XXX.nc为文件名,为字符串需要’’;var是从XXX.nc中读取的变量名,为字符串需要’’;i、j、t分别为读取数据的起始行、列、时间,leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样,研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令,可查看。数据坐标系统建议使用WGS84。

国家青藏高原科学数据中心 收录

OpenSonarDatasets

OpenSonarDatasets是一个致力于整合开放源代码声纳数据集的仓库,旨在为水下研究和开发提供便利。该仓库鼓励研究人员扩展当前的数据集集合,以增加开放源代码声纳数据集的可见性,并提供一个更容易查找和比较数据集的方式。

github 收录

中国知识产权局专利数据库

该数据集包含了中国知识产权局发布的专利信息,涵盖了专利的申请、授权、转让等详细记录。数据内容包括专利号、申请人、发明人、申请日期、授权日期、专利摘要等。

www.cnipa.gov.cn 收录

VisDrone 2021

VisDrone2021 数据集由天津大学机器学习与数据挖掘实验室 AISKYEYE 团队收集。基准数据集由 400 个视频片段组成,由 265,228 帧和 10,209 张静态图像组成,由各种无人机摄像头拍摄,涵盖了广泛的方面,包括位置(取自中国相隔数千公里的 14 个不同城市)、环境(城市和乡村)、物体(行人、车辆、自行车等)和密度(稀疏和拥挤的场景)。请注意,数据集是使用各种无人机平台(即具有不同型号的无人机)、在不同场景以及各种天气和照明条件下收集的。这些框架使用超过 260 万个边界框或经常感兴趣的目标点进行手动注释,例如行人、汽车、自行车和三轮车。为了更好地利用数据,还提供了一些重要的属性,包括场景可见性、对象类别和遮挡。

OpenDataLab 收录