five

Cross-model evaluation of phishing detectors against LLM-generated emails: dataset, code and results

收藏
Zenodo2026-05-17 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20250116
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset accompanies the manuscript "Cross-model evaluation of phishing detectors against LLM-generated emails" by Gutierrez, Villegas-Ch and Govea (2026), submitted to Frontiers. The repository contains the complete data, code and experimental results from the study: - Code (Python 3.11): a 7-script pipeline for assembling the corpus, extracting 17 stylometric features, training and evaluating classifiers under intra-model, cross-model, threshold-recalibrated, cross-dataset, and aggregated-pool conditions, and generating all figures in the manuscript. - Data: stylometric feature CSVs for a combined corpus of 9,986 phishing emails (5,000 human-written from CEAS-08, TREC-07, Nazario, Nigerian Fraud, lingspam and a fraud-labeled Enron subset; and 4,986 LLM-generated using GPT-4.1, DeepSeek 3.2 and LLaMA 3.3 70B). - Results: per-task outputs including the 3x3 cross-model transferability matrix, its threshold-recalibrated counterpart, intra-model 5-fold cross-validation metrics, cross-dataset human verification, aggregated-pool results, and SHAP feature-importance values per LLM. Key headline findings: intra-model F1 above 0.955 with XGBoost on all three LLMs; default-threshold cross-model transferability gap of 28.1 percentage points; gap reduced to 4.0 percentage points (86% reduction) by recalibrating the decision threshold on a 30% slice of the target LLM; aggregated-pool detector achieves F1 = 0.997 on each individual LLM. Code is released under MIT License; data under Creative Commons Attribution 4.0 International (CC BY 4.0). The dataset is intended exclusively for defensive security research and academic study. See README.md and LICENSE for details and responsible-use statement.
提供机构:
Zenodo
创建时间:
2026-05-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作