Cross-model evaluation of phishing detectors against LLM-generated emails: dataset, code and results
收藏Zenodo2026-05-17 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20250116
下载链接
链接失效反馈官方服务:
资源简介:
This dataset accompanies the manuscript "Cross-model evaluation of phishing detectors against LLM-generated emails" by Gutierrez, Villegas-Ch and Govea (2026), submitted to Frontiers.
The repository contains the complete data, code and experimental results from the study:
- Code (Python 3.11): a 7-script pipeline for assembling the corpus, extracting 17 stylometric features, training and evaluating classifiers under intra-model, cross-model, threshold-recalibrated, cross-dataset, and aggregated-pool conditions, and generating all figures in the manuscript.
- Data: stylometric feature CSVs for a combined corpus of 9,986 phishing emails (5,000 human-written from CEAS-08, TREC-07, Nazario, Nigerian Fraud, lingspam and a fraud-labeled Enron subset; and 4,986 LLM-generated using GPT-4.1, DeepSeek 3.2 and LLaMA 3.3 70B).
- Results: per-task outputs including the 3x3 cross-model transferability matrix, its threshold-recalibrated counterpart, intra-model 5-fold cross-validation metrics, cross-dataset human verification, aggregated-pool results, and SHAP feature-importance values per LLM.
Key headline findings: intra-model F1 above 0.955 with XGBoost on all three LLMs; default-threshold cross-model transferability gap of 28.1 percentage points; gap reduced to 4.0 percentage points (86% reduction) by recalibrating the decision threshold on a 30% slice of the target LLM; aggregated-pool detector achieves F1 = 0.997 on each individual LLM.
Code is released under MIT License; data under Creative Commons Attribution 4.0 International (CC BY 4.0). The dataset is intended exclusively for defensive security research and academic study. See README.md and LICENSE for details and responsible-use statement.
提供机构:
Zenodo
创建时间:
2026-05-17



