five

url phishing

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/6nhtnmn2yk
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset employed in this study is a large-scale, clustered phishing detection dataset designed to support advanced machine learning (ML), deep learning (DL), and hybrid AI-based approaches for identifying phishing and malicious URLs. The specific dataset under consideration, referred to as the Cluster dataset, contains 147,292 individual samples, each corresponding to a unique URL instance. These instances represent both malicious (phishing) and benign (legitimate) URLs collected from multiple heterogeneous sources, ensuring diversity in terms of domain structure, hosting infrastructure, and attack sophistication. The dataset is structured for binary classification, making it suitable for supervised learning paradigms. Each sample is described by 112 numerical features, all of which are derived from URL strings, domain metadata, DNS records, and network-level observations. The exclusive use of numeric features eliminates the need for extensive encoding or tokenization steps, allowing direct compatibility with a wide range of ML and DL algorithms. Class Labels and Distribution The target variable in the dataset is denoted as label, which follows a binary encoding scheme: Label = 1: Indicates phishing or malicious URLs Label = 0: Indicates legitimate or benign URLs Out of the total 147,292 samples, the dataset includes: 61,294 malicious URLs (positive class) 85,998 benign URLs (negative class) This distribution reflects a moderate class imbalance, with benign URLs slightly dominating the dataset. Such imbalance is typical of real-world cybersecurity datasets, where legitimate traffic generally exceeds malicious activity. The presence of this imbalance makes the dataset particularly useful for evaluating classifier robustness, precision–recall trade-offs, and cost-sensitive learning strategies. URL-Centric Feature Design URLs remain one of the most widely exploited vectors for phishing attacks, serving as entry points for credential theft, malware delivery, and social engineering campaigns. Modern phishing URLs often employ lexical obfuscation, domain impersonation, excessive parameterization, and short-lived infrastructure to evade detection. To address these challenges, the dataset emphasizes URL-based characteristics that capture both surface-level patterns and deep structural cues associated with malicious intent. The selected features aim to balance interpretability, discriminative power, and computational efficiency, making them suitable for both traditional ML models and complex DL architectures. Feature Composition and Categorization The 112 features in the dataset can be broadly categorized into the following groups: Lexical and Character-Level Features Structural and Length-Based Features Directory and Parameter Analysis Features Domain and Host-Based Features Network and Infrastructure-Level Features Security and Certificate-Related Features
创建时间:
2026-02-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作