CompPhish
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/fmbs4kp9wz
下载链接
链接失效反馈官方服务:
资源简介:
About the dataset :
A comprehensive phishing dataset which includes labelled phishing as well as legitimate URLs along with their respective HTML codes. Each URL and its HTML code file is associated with the same serial number. The dataset size is 15,358 samples, where 7,204 samples are phishing, and 8,154 are legitimate.
Data Collection:
Phishing URLs are collected from PhishTank and OpenPhish repositories and legitimate URLs from the DataForSEO Top-1000 websites list. The HTML codes of the URLs are downloaded by using the Python Programming Language after visiting the URL while it is active.
Label Information:
Labels 0 for legitimate and 1 for phishing are used.
Information about Features:
70 features are extracted from the raw URLs and their HTML codes. These features cover various types of phishing attacks: URL-based phishing attacks, brand-jacking, phishing sites hosted on compromised domains (PSHCD), and auto-downloadable malicious files links.
Usage:
The processed dataset can be used by researchers for further analysis by applying various ML algorithms or feature selection techniques to achieve considerable results. The raw URLs and their HTML source code can also be used for extracting novel features and proposing novel detection methodologies.
创建时间:
2025-12-15



