five

ISCX-URL-2016

收藏
DataCite Commons2023-12-22 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/iscx-url-2016
下载链接
链接失效反馈
官方服务:
资源简介:
The Web has long become a major platform for online criminal activities. URLs are used as the main vehicle in this domain. To counter this issues security community focused its efforts on developing techniques for mostly blacklisting of malicious URLs.Benign URLs: Over 35,300 benign URLs were collected from Alexa top websites. The domains have been passed through a Heritrix web crawler to extract the URLs. Around half a million unique URLs are crawled initially and then passed to remove duplicate and domain only URLs. Later the extracted URLs have been checked through Virustotal to filter the benign URLs.Spam URLs: Around 12,000 spam URLs were collected from the publicly available WEBSPAM-UK2007 dataset.Phishing URLs: Around 10,000 phishing URLs were taken from OpenPhish which is a repository of active phishing sites.Malware URLs: More than 11,500 URLs related to malware websites were obtained from DNS-BH which is a project that maintain list of malware sites.Defacement URLs: More than 45,450 URLs belong to Defacement URL category. They are Alexa ranked trusted websites hosting fraudulent or hidden URL that contains both malicious web pages.

万维网早已成为网络犯罪活动的主要滋生平台,统一资源定位符(Uniform Resource Locator,URL)是该领域的主要传播载体。为应对此类问题,安全社区长期致力于研发主要针对恶意URL的黑名单构建技术。 良性URL:从Alexa热门网站中采集了逾35300条良性URL。先通过Heritrix网络爬虫对相关域名进行爬取以提取URL,初始爬取得到约50万条唯一URL,随后对其进行去重处理并移除仅保留域名的URL;后续再通过Virustotal对提取出的URL进行检测,以筛选出合格的良性URL。 垃圾邮件URL:从公开数据集WEBSPAM-UK2007中采集了约12000条垃圾邮件URL。 钓鱼URL:从活跃钓鱼站点仓库OpenPhish中获取了约10000条钓鱼URL。 恶意软件URL:从维护恶意软件站点列表的项目DNS-BH中获取了逾11500条关联恶意软件网站的URL。 网页篡改URL:共计逾45450条URL属于网页篡改URL类别,这些URL所在的网站为Alexa排名靠前的可信网站,但其中包含了搭载恶意网页的欺诈性或隐藏链接。
提供机构:
IEEE DataPort
创建时间:
2023-12-22
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
ISCX-URL-2016数据集是一个包含多种恶意URL分类的数据集,适用于机器学习算法进行恶意URL预测。数据集包括良性URL、垃圾邮件URL、钓鱼URL、恶意软件URL和篡改URL等类型,数据格式为CSV。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作