five

Navyasri17/phishing_emails-data

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Navyasri17/phishing_emails-data
下载链接
链接失效反馈
官方服务:
资源简介:
# 🛡️ Phishing Email Classification Dataset This dataset is curated for fine-tuning LLMs on the task of phishing email detection. It originates from [this Kaggle dataset](https://www.kaggle.com/datasets/subhajournal/phishingemails) and has been transformed to better suit LLM-based classification tasks. ## 📦 Dataset Features - Each row is a labeled email, with either: - `safe email` (label = 0) - `phishing email` (label = 1) - The dataset includes metadata (sender, receiver, date, subject) and cleaned email body. - Two main columns: - `Email Text`: Complete formatted text including metadata and message content. - `label`: Binary label indicating if the email is phishing. ## 🧠 LLM Fine-Tuning Ready Processed using a `phishing_items.py` parser: - Truncates or filters emails based on token limits for LLM input (between 30 and 250 tokens). - Builds classification prompts in the format: ``` Is the following email safe or phishing?? [email content] Email type is: [safe email/phishing email] ``` - Optimized for models such as `meta-llama/Meta-Llama-3.1-8B`. ## 🧼 Preprocessing Highlights - Removes non-informative characters (e.g., `=`, `>`, `\`) and extra whitespace. - Tokenized with Hugging Face's `AutoTokenizer`. - Discards overly short emails (under 120 characters or under 30 tokens). ## 🗂️ Example Usage ```python from phishing_items import Item item = Item(data_row) if item.include: print(item.prompt) ``` ## 📚 Source - Original dataset: [Kaggle - Phishing Emails](https://www.kaggle.com/datasets/subhajournal/phishingemails) - Transformed by: [your GitHub or Hugging Face handle]
提供机构:
Navyasri17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作