url phishing
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/6nhtnmn2yk
下载链接
链接失效反馈官方服务:
资源简介:
The dataset employed in this study is a large-scale, clustered phishing detection dataset designed to support advanced machine learning (ML), deep learning (DL), and hybrid AI-based approaches for identifying phishing and malicious URLs.
The specific dataset under consideration, referred to as the Cluster dataset, contains 147,292 individual samples, each corresponding to a unique URL instance. These instances represent both malicious (phishing) and benign (legitimate) URLs collected from multiple heterogeneous sources, ensuring diversity in terms of domain structure, hosting infrastructure, and attack sophistication. The dataset is structured for binary classification, making it suitable for supervised learning paradigms.
Each sample is described by 112 numerical features, all of which are derived from URL strings, domain metadata, DNS records, and network-level observations. The exclusive use of numeric features eliminates the need for extensive encoding or tokenization steps, allowing direct compatibility with a wide range of ML and DL algorithms.
Class Labels and Distribution
The target variable in the dataset is denoted as label, which follows a binary encoding scheme:
Label = 1: Indicates phishing or malicious URLs
Label = 0: Indicates legitimate or benign URLs
Out of the total 147,292 samples, the dataset includes:
61,294 malicious URLs (positive class)
85,998 benign URLs (negative class)
This distribution reflects a moderate class imbalance, with benign URLs slightly dominating the dataset. Such imbalance is typical of real-world cybersecurity datasets, where legitimate traffic generally exceeds malicious activity. The presence of this imbalance makes the dataset particularly useful for evaluating classifier robustness, precision–recall trade-offs, and cost-sensitive learning strategies.
URL-Centric Feature Design
URLs remain one of the most widely exploited vectors for phishing attacks, serving as entry points for credential theft, malware delivery, and social engineering campaigns. Modern phishing URLs often employ lexical obfuscation, domain impersonation, excessive parameterization, and short-lived infrastructure to evade detection.
To address these challenges, the dataset emphasizes URL-based characteristics that capture both surface-level patterns and deep structural cues associated with malicious intent. The selected features aim to balance interpretability, discriminative power, and computational efficiency, making them suitable for both traditional ML models and complex DL architectures.
Feature Composition and Categorization
The 112 features in the dataset can be broadly categorized into the following groups:
Lexical and Character-Level Features
Structural and Length-Based Features
Directory and Parameter Analysis Features
Domain and Host-Based Features
Network and Infrastructure-Level Features
Security and Certificate-Related Features
创建时间:
2026-02-15



