qgyd2021/spam_detect|垃圾信息检测数据集|数据分析数据集
收藏垃圾信息检测数据集概述
数据集概览
该数据集用于垃圾信息检测,包括垃圾邮件、垃圾短信息和广告识别。数据集包含多种语言和任务类型的样本,主要以英语和汉语为主。
数据来源
数据集从多个公开资源中收集整理,具体如下:
数据集名称 | 语言 | 任务类型 | 原始数据/项目地址 | 样本个数 | 原始数据描述 | 替代数据下载地址 |
---|---|---|---|---|---|---|
enron_spam | 英语 | 垃圾邮件分类 | enron_spam_data; Enron-Spam; spam-mails-dataset | ham: 16545; spam: 17171 | Enron-Spam 数据集是 V. Metsis、I. Androutsopoulos 和 G. Paliouras 收集的绝佳资源 | SetFit/enron_spam; enron-spam |
enron_spam_subset | 英语 | 垃圾邮件分类 | email-spam-dataset | ham: 5000; spam: 5000 | ||
ling_spam | 英语 | 垃圾邮件分类 | lingspam-dataset; email-spam-dataset | ham: 2172; spam: 433 | Ling-Spam 数据集是从语言学家列表中整理的 2,893 条垃圾邮件和非垃圾邮件消息的集合。 | |
sms_spam | 英语 | 垃圾短信分类 | SMS Spam Collection; SMS Spam Collection Dataset | ham: 4827; spam: 747 | SMS 垃圾邮件集合是一组公开的 SMS 标记消息,为移动电话垃圾邮件研究而收集。 | sms_spam |
sms_spam_collection | 英语 | 垃圾短信分类 | spam-emails | ham: 4825; spam: 747 | 该数据集包含电子邮件的集合 | email-spam-detection-dataset-classification; spam-identification; sms-spam-collection; spam-or-ham |
spam_assassin | 英语 | 垃圾邮件分类 | datasets-spam-assassin; Apache SpamAssassin’s public datasets; Spam or Not Spam Dataset | ham: 4150; spam: 1896 | 数据集从email-spam-dataset的completeSpamAssassin.csv文件而来。 | email-spam-dataset; talby/SpamAssassin; spamassassin-2002 |
spam_base | 英语 | 垃圾邮件分类 | spambase | 将电子邮件分类为垃圾邮件或非垃圾邮件 | spam-email-data-uci | |
spam_detection | 英语 | 垃圾短信分类 | Deysi/spam-detection-dataset | ham: 5400; spam: 5500 | ||
spam_message | 汉语 | 垃圾短信分类 | SpamMessage | ham: 720000; spam: 80000 | 其中spam的数据是正确的数据,但是做了脱敏处理(招生电话:xxxxxxxxxxx),这里的 x 可能会成为显著特征。而ham样本像是从普通文本中截断出来充作样本的,建议不要用这些数据。 | |
spam_message_lr | 汉语 | 垃圾短信分类 | SpamMessagesLR | ham: 3983; spam: 6990 | ||
trec07p | 英语 | 垃圾邮件分类 | 2007 TREC Public Spam Corpus; Spam Track | ham: 25220; spam: 50199 | 2007 TREC Public Spam Corpus | trec07p.tar.gz |
trec06c | 汉语 | 垃圾邮件分类 | 2006 TREC Public Spam Corpora | 2006 TREC Public Spam Corpora | ||
youtube_spam_collection | 英语 | 垃圾评论分类 | youtube+spam+collection; YouTube Spam Collection Data Set | ham: 951; spam: 1005 | 它是为垃圾邮件研究而收集的公共评论集。 |
样本示例
enron_spam 样本示例
wanted to try ci 4 lis but thought it was way too expensive for you ? <br> viagra at $ 1 . 12 per dose ready to boost your sex life ? positive ? time to do it right now . order viagra at incredibly low prices $ 1 . 12 per dose . unbelivable remove <br> spam
enron / hpl actuals for december 11 , 2000 <br> teco tap 30 . 000 / enron ; 120 . 000 / hpl gas daily ls hpl lsk ic 30 . 000 / enron ham
looking for cheap high - quality software ? rotated napoleonizes <br> water past also , burn , course . gave country , mass lot . act north good . from , learn form most brother vary . when more for . up stick , century put , song be . test , describe , plain , against wood star . began dress ever group . here oh , most world stay . <br> spam
ideabank website <br> please read the attached document for information about an exciting new website for ets employees ! ham
enron_spam_subset 样本示例
Subject: edrugs online viagra - proven step to start something all over again . nothing is more useful than silence . teachers open the door . you enter by yourself . how sharper than a serpent s tooth it isto have a thankless child ! spam
Subject: start date : 12 / 13 / 01 ; hourahead hour : 5 ; start date : 12 / 13 / 01 ; hourahead hour : 5 ; no ancillary schedules awarded . no variances detected . log messages : parsing file - - > > o : portland westdesk california scheduling iso final schedules 2001121305 . txt ham
Subject: cheapestt medss ! mediccationns at lowesst pricess everyy ! over 80 . % offf , pricess wontt get lowerr we selll vic od ( in v , ia . gra x , ana . x http : / / www . pr 3 sdlugs . com / ? refid = 87 <br> spam
Subject: fw : picture
the following is an aerial photo of the wtc area . it kinda brings on vertigo , but is a phenomenal shot .
http : / / userwww . service . emory . edu / ~ rdgarr / wtc . htm ham
ling_spam 样本示例
Subject: internet specialist 007 - the spy <br> internet specialist 007 - the spy learn everything about your friends , neighbors , enemies , employees or anyone else ! - - even your boss ! - - even yourself ! this mammoth snoop collection of internet sites will provide you the newest and most current addresses available on the net today . . . = 20 * track down an old debt , or anyone else that has done you wrong ! it s incredible , and so many new data sites have come online in the past 90 days . . . * over 300 giant resources to look up people , credit , social security , current or past employment , mail order purchases , = 20 addresses , phone numbers , maps to city locations . . . * investigate your family history ! check birth , death , adoption or social security records check service records or army , navy , air force or = 20 marine corps . * locate an old friend ( or an enemy who is hiding ) or a lost = 20 love - - find e-mail , telephone or address information on anyone ! = 20 even look up * unlisted * phone numbers ! * find work by searching classified ads all over the world ! * screen prospective employees - - check credit , driving or criminal records verify income or educational accomplishments = 20 * check out your daughter s new boyfriend ! * find trial transcripts and court orders ! * enjoy the enchantment of finding out a juicy tid-bit about a co-worker . the internet is a powerful megasource of information , = 20 if you only know where to look . i tell you how to find = 20 out nearly anything about anybody , and tell you exactly where to find it ! you will be amazed to find out what personal information = 20 other people can find out about you ! check your credit = 20 report so you can correct wrong information that may be = 20 used to deny you credit . research yourself first ! you ll be horrified , as i was , = 20 at how much data has been accumulated about you . any my huge collection is only the beginning ! once you = 20 locate these free private , college and government web sites , you ll find even more links to even more = 20 information search engines ! = 20 if you believe ( like i do ) that the information that is stored about each one of us should be freely accessible , you ll want to see the snoop collection i ve compiled . verify your own records , or find out what you need to = 20 know about others . i m telling you , it s incredible what you can find out using the internet ! we will accept checks by fax at 813-269 - 9651 or > > > send $ 14 . 95 cash , check or money order to : > > > the coldwell group > > > p . o . box 3787 > > > dept 1007 > > > petersburg , va 23805 i will rush back to you my snoop information for fastest service include your * e-mail * address . = 20 * what information is available - - and exact url to get there ! * exactly where to look for - - and the clever way to use - - = 20 the above search engines , and tons more ! * my easy-to - browse categorized megacenter of information has my own description of how to use each site , and what you ll find when you get there - - and tricky tips on how to = 20 extract the best data ! you can know everything about everybody with this internet specialist collection ! * * soon to be available - - the most complete international internet spy = 20 sites available on the web today * * don t miss this one or you ll be sorry = 20 to be removed from our list please fax your address to 813-269 - 9651 . l = e3 = 01 @ u = 0b <br> spam
Subject: usage - based models - symposium <br> announcing the sixth biennial symposium of the rice university department of linguistics usage-based models of language rice university march 15-18 , 1995 invited speakers : mira ariel tel aviv university joan bybee university of new mexico john du bois university of california , santa barbara michael israel university of california , san diego sydney lamb rice university ronald langacker university of california , san diego tom givon university of oregon brian macwhinney carnegie - mellon university janet pierrehumbert northwestern university john sinclair university of birmingham ( u . k . ) arie verhagen university of utrecht description : the goal of this symposium is to explore approaches to linguistic theory that have in common the aim of accounting for linguistic usage . the empirical data for such theories is not restricted to linguistic intuitions about acceptibility , but comes from usage events of varied types . the focus is on the patterns found in the various sorts of usage data examined , and how those patterns can be extracted , represented , and used by the human mind . research from a variety of traditions will be represented , including corpus-based analyses , discourse studies , experimental studies of language processing and language acquisition , and instrumental phonetics . the approaches taken can be called data-driven , rather than model-driven , in that the fewest possible prior assumptions are made about what types of data are relevant , and that large sets of usage events are observed so that the detailed patterns found in actual usage can emerge . moreover , the various approaches taken show signs of converging toward a view of language as a dynamic system in which linguistic knowledge is not separate from its processing in language use . the linguistic models representing this view are usage-based by virtue of three factors : ( 1 ) the importance placed on usage data for theory construction ; ( 2 ) the direct incorporation of processing ( production and comprehension ) into linguistic theory ; and ( 3 ) the requirement that the models arrived at , whatever the direct source of evidence , must be testable with reference to language use . registration : no charge . symposium attendance on a space-available basis . for further information , contact suzanne kemmer ( kemmer @ ruf . rice . edu ) or michael barlow ( barlow @ ruf . rice . edu ) snailmail : dept . of linguistics , rice university , houston tx 77251-1892 . <br> ham
Subject: domani <br> new improved with free software , free bulk e mail system , free web site = to do what you wish , ongoing support ( optional ) , and a lot more ! all = included . . . . . . . . . . . this is a one time mailing . . . . . . . . . . . . . . . $ = you are about to make at least $ 50 , 000 in less than 90 days read the enclosed program . . . then read it again . . . / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
中国区域交通网络数据集
该数据集包含中国各区域的交通网络信息,包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。
data.stats.gov.cn 收录
中国1km分辨率逐月降水量数据集(1901-2023)
该数据集为中国逐月降水量数据,空间分辨率为0.0083333°(约1km),时间为1901.1-2023.12。数据格式为NETCDF,即.nc格式。该数据集是根据CRU发布的全球0.5°气候数据集以及WorldClim发布的全球高分辨率气候数据集,通过Delta空间降尺度方案在中国降尺度生成的。并且,使用496个独立气象观测点数据进行验证,验证结果可信。本数据集包含的地理空间范围是全国主要陆地(包含港澳台地区),不含南海岛礁等区域。为了便于存储,数据均为int16型存于nc文件中,降水单位为0.1mm。 nc数据可使用ArcMAP软件打开制图; 并可用Matlab软件进行提取处理,Matlab发布了读入与存储nc文件的函数,读取函数为ncread,切换到nc文件存储文件夹,语句表达为:ncread (‘XXX.nc’,‘var’, [i j t],[leni lenj lent]),其中XXX.nc为文件名,为字符串需要’’;var是从XXX.nc中读取的变量名,为字符串需要’’;i、j、t分别为读取数据的起始行、列、时间,leni、lenj、lent i分别为在行、列、时间维度上读取的长度。这样,研究区内任何地区、任何时间段均可用此函数读取。Matlab的help里面有很多关于nc数据的命令,可查看。数据坐标系统建议使用WGS84。
国家青藏高原科学数据中心 收录
OpenSonarDatasets
OpenSonarDatasets是一个致力于整合开放源代码声纳数据集的仓库,旨在为水下研究和开发提供便利。该仓库鼓励研究人员扩展当前的数据集集合,以增加开放源代码声纳数据集的可见性,并提供一个更容易查找和比较数据集的方式。
github 收录
中国知识产权局专利数据库
该数据集包含了中国知识产权局发布的专利信息,涵盖了专利的申请、授权、转让等详细记录。数据内容包括专利号、申请人、发明人、申请日期、授权日期、专利摘要等。
www.cnipa.gov.cn 收录
VisDrone 2021
VisDrone2021 数据集由天津大学机器学习与数据挖掘实验室 AISKYEYE 团队收集。基准数据集由 400 个视频片段组成,由 265,228 帧和 10,209 张静态图像组成,由各种无人机摄像头拍摄,涵盖了广泛的方面,包括位置(取自中国相隔数千公里的 14 个不同城市)、环境(城市和乡村)、物体(行人、车辆、自行车等)和密度(稀疏和拥挤的场景)。请注意,数据集是使用各种无人机平台(即具有不同型号的无人机)、在不同场景以及各种天气和照明条件下收集的。这些框架使用超过 260 万个边界框或经常感兴趣的目标点进行手动注释,例如行人、汽车、自行车和三轮车。为了更好地利用数据,还提供了一些重要的属性,包括场景可见性、对象类别和遮挡。
OpenDataLab 收录