PhishLegitURLs: A Comprehensive Dataset of Legitimate and Phishing URLs

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/j43jtv3zzc

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains a collection of approximately 1,000 URLs, evenly distributed between phishing and legitimate web addresses, designed for use in research and development of phishing detection models. The dataset is structured as follows: Phishing URLs (Approx. 700): These URLs have been sourced from the URLHaus database, a well-known repository of malicious websites actively used in phishing attacks. Each entry in this subset has been manually verified and is labeled as a phishing URL, making this dataset highly reliable for identifying harmful web content. Legitimate URLs (Approx. 300): The legitimate URLs have been collected from reputable sources such as Wikipedia and Stack Overflow. These websites are known for hosting user-generated content and community discussions, ensuring that the URLs represent safe, legitimate web addresses. The URLs were randomly scraped to ensure diversity in the types of legitimate sites included. Dataset Features: URL: The full web address of each entry, providing the primary feature for analysis. Label: A binary label indicating whether the URL is legitimate (1) or phishing (0). Applications: This dataset is suitable for training and evaluating machine learning models aimed at distinguishing between phishing and legitimate websites. It can be used in a variety of cybersecurity research projects, including URL-based phishing detection, web content analysis, and the development of real-time protection systems. Usage: Researchers can leverage this balanced dataset to develop and test algorithms for identifying phishing websites with high accuracy, using features such as URL structure, and class label attributes. The inclusion of both phishing and legitimate URLs provides a comprehensive basis for creating robust models capable of detecting phishing attempts in diverse online environments.

本数据集包含约1000条统一资源定位符（Uniform Resource Locator，下文简称URL），其中钓鱼网址与合法网址分布均衡，专为钓鱼检测模型的研发与相关研究设计。数据集结构如下：钓鱼网址（约700条）：此类网址来源于知名恶意网站仓库URLHaus，该仓库是广泛用于存储钓鱼攻击相关恶意站点的权威资源库。该子集下的每条条目均经过人工验证，并标注为钓鱼网址，使得本数据集在识别有害网络内容时具备极高可靠性。合法网址（约300条）：合法网址采集自维基百科（Wikipedia）、栈溢出（Stack Overflow）等权威平台。此类平台以承载用户生成内容与社区讨论著称，确保收录的网址均为安全合法的网络地址。本次采集采用随机爬取方式，保证了合法站点类型的多样性。数据集特征： URL：每条条目的完整网络地址，为分析提供核心特征。标签（Label）：二元分类标签，用于标识对应网址为合法（1）或钓鱼（0）。应用场景：本数据集适用于训练与评估旨在区分钓鱼与合法网站的机器学习模型，可广泛应用于各类网络安全研究项目，包括基于URL的钓鱼检测、网页内容分析，以及实时防护系统的开发。使用说明：研究人员可借助这份分布均衡的数据集，基于URL结构、类别标签等特征，开发并测试高精度的钓鱼网站识别算法。本数据集同时涵盖钓鱼与合法网址，为构建鲁棒性强的检测模型提供了全面基础，使其能够在多样化的网络环境中有效检测钓鱼攻击。

创建时间：

2024-10-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集