Navyasri17/phishing_emails-data

Name: Navyasri17/phishing_emails-data
Creator: Navyasri17
Published: 2026-04-07 07:50:06
License: 暂无描述

Hugging Face2026-04-07 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Navyasri17/phishing_emails-data

下载链接

链接失效反馈

官方服务：

资源简介：

# 🛡️ Phishing Email Classification Dataset This dataset is curated for fine-tuning LLMs on the task of phishing email detection. It originates from [this Kaggle dataset](https://www.kaggle.com/datasets/subhajournal/phishingemails) and has been transformed to better suit LLM-based classification tasks. ## 📦 Dataset Features - Each row is a labeled email, with either: - `safe email` (label = 0) - `phishing email` (label = 1) - The dataset includes metadata (sender, receiver, date, subject) and cleaned email body. - Two main columns: - `Email Text`: Complete formatted text including metadata and message content. - `label`: Binary label indicating if the email is phishing. ## 🧠 LLM Fine-Tuning Ready Processed using a `phishing_items.py` parser: - Truncates or filters emails based on token limits for LLM input (between 30 and 250 tokens). - Builds classification prompts in the format: ``` Is the following email safe or phishing?? [email content] Email type is: [safe email/phishing email] ``` - Optimized for models such as `meta-llama/Meta-Llama-3.1-8B`. ## 🧼 Preprocessing Highlights - Removes non-informative characters (e.g., `=`, `>`, `\`) and extra whitespace. - Tokenized with Hugging Face's `AutoTokenizer`. - Discards overly short emails (under 120 characters or under 30 tokens). ## 🗂️ Example Usage ```python from phishing_items import Item item = Item(data_row) if item.include: print(item.prompt) ``` ## 📚 Source - Original dataset: [Kaggle - Phishing Emails](https://www.kaggle.com/datasets/subhajournal/phishingemails) - Transformed by: [your GitHub or Hugging Face handle]

提供机构：

Navyasri17

5,000+

优质数据集

54 个

任务类型

进入经典数据集