SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/hz2d6gz7pc
下载链接
链接失效反馈官方服务:
资源简介:
Spanish is widely used in real-world phishing campaigns, yet public email corpora remain largely English-centric and rarely encode social-engineering tactics at the psychological level. Consequently, research on Spanish phishing has often been reduced to binary detection, limiting the systematic study of how manipulation is conveyed through language. SpaPhish addresses this gap by providing a Spanish-native email corpus annotated under Ana Ferreira’s Principles of Persuasion framework.
The dataset contains 1,395 emails described by 47 variables. Each record is identified by a SHA-256 hash key and includes the subject, body, and a date field, which is parseable for 1,371 records and spans from July 2014 to October 2025. A binary class label is available for all entries, with 664 legitimate emails and 731 phishing emails.
SpaPhish is structured as a multi-layer resource. Its technical layer includes derived attributes such as URL statistics, routing depth, and attachment metadata. Link-bearing content appears in 86.02% of messages. At the class level, legitimate emails show higher mean values for both url_count (8.47 vs. 4.94) and attachments_count (0.715 vs. 0.033) than phishing emails.
A central contribution of the dataset is its psychological annotation layer. Three independent annotators labeled each message across five persuasion dimensions: authority, social_proof, liking_similarity_deception, commitment_integrity_reciprocation, and distraction. The dataset preserves the individual annotator decisions through per-annotator columns (*_A, *_B, *_C) and associated justification fields (justif_*), while also providing consolidated consensus labels for benchmarking and inter-annotator agreement analysis. In cases of complete disagreement, a fourth expert adjudicator resolved the final label.
The repository also includes supporting documentation and resources: the primary dataset file (SpaPhish dataset-DiB.csv), a machine-readable schema (dataset_schema.json), a complete variable reference with data types, descriptions, and extraction logic for all 47 variables (SpaPhish_Dataset_Schema.pdf), a data dictionary (README.txt), and an interactive HTML exploratory report (SpaPhish_html_report.zip). Processing and analysis scripts are available at: https://github.com/lbustio/spa_phish.
SpaPhish supports research on Spanish phishing detection, persuasion modeling, and annotation-driven explainability by linking technical email attributes with psychologically grounded manipulation strategies.
创建时间:
2026-04-17



