Replication Package of "Battling Phish"

Figshare2025-10-09 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Replication_Package_of_Battling_Phish_/30324559

下载链接

链接失效反馈

官方服务：

资源简介：

This replication package contains all datasets and source code used in our study on phishing URL detection. The study investigates the effectiveness of traditional machine learning (ML), deep learning (DL), and large language model (LLM)-based methods using a consistent set of 32 URL-based features.Directory Structure├── Datasets/│ ├── Dataset-1.csv│ ├── Dataset-2.csv│ ├── Dataset-3.csv│ ├── Dataset-4.csv│ ├── Dataset-5.csv│ ├── Phishing_Site_URLs_32_Features_Extracted_Data.csv│ └── Legit_Phish_32_Features_Extracted_Data.csv│└── Source_Codes/ ├── Feature_extraction_source_code.py ├── Feature_importance_analysis_source_code.py ├── ML/ │ ├── Seven_ML_Models_trained_on_LP.py │ ├── Seven_ML_Models_trained_on_PSU.py │ ├── SoftVoting_trained_on_LP.py │ ├── SoftVoting_trained_on_PSU.py │ ├── HardVoting_trained_on_LP.py │ └── HardVoting_trained_on_PSU.py │ ├── DL/ │ ├── [DLModel1]_trained_on_LP.py │ ├── [DLModel1]_trained_on_PSU.py │ └── ... (total 16 files for 8 DL algorithms) │ └── LLM/ ├── BERT_Fine_Tuned_on_LP.py ├── BERT_Fine_Tuned_on_PSU.py ├── DistilBERT_Fine_Tuned_on_LP.py ├── DistilBERT_Fine_Tuned_on_PSU.py ├── PhishBERT_Evaluation.py └── URLBERT_Evaluation.pyDatasets:Dataset-1.csv to Dataset-5.csv:Used for feature importance analysis.Phishing_Site_URLs_32_Features_Extracted_Data.csv (PSU dataset):Includes phishing and legitimate URLs with 32 extracted lexical features.Legit_Phish_32_Features_Extracted_Data.csv (LP dataset):Another benchmark dataset with the same 32 features, used for comparative evaluation.Note: PSU and LP datasets are used for both training and evaluating ML, DL, and LLM-based models.Source Code:Feature_extraction_source_code.pyExtracts 32 handcrafted lexical features from raw URL data.Feature_importance_analysis_source_code.pyPerforms feature selection using seven statistical and model-based ranking methods.Machine Learning (ML)Implements ML classifiers individually trained on LP and PSU datasets:Logistic Regression, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, and XGBoost.Soft Voting and Hard Voting ensembles are also implemented.Scripts:Seven_ML_Models_trained_on_LP.pySeven_ML_Models_trained_on_PSU.pySoftVoting_trained_on_LP.py, SoftVoting_trained_on_PSU.pyHardVoting_trained_on_LP.py, HardVoting_trained_on_PSU.pyDeep Learning (DL)Implements eight deep learning architectures (each trained separately on LP and PSU):Total of 16 scripts — 2 per DL model (1 for LP, 1 for PSU).Large Language Models (LLMs)Fine-tuned:BERT_Fine_Tuned_on_LP.py, BERT_Fine_Tuned_on_PSU.pyDistilBERT_Fine_Tuned_on_LP.py, DistilBERT_Fine_Tuned_on_PSU.pyPre-trained, zero-shot or direct evaluation:PhishBERT_Evaluation.pyURLBERT_Evaluation.py

创建时间：

2025-10-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集