IRBXrocket/phishing-url

Name: IRBXrocket/phishing-url
Creator: IRBXrocket
Published: 2026-04-06 10:24:21
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/IRBXrocket/phishing-url

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 configs: - config_name: default data_files: - split: train path: data/train.parquet - split: test path: data/test.parquet task_categories: - text-classification - tabular-classification - token-classification - text2text-generation size_categories: - n<1K annotations_creators: - found tags: - phishing - url - security language: - en pretty_name: TabNetone --- # Dataset Description The provided dataset includes **11430** URLs with **87** extracted features. The dataset are designed to be used as a benchmark for machine learning based **phishing detection** systems. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Features are from three different classes: - **56** extracted from the structure and syntax of URLs - **24** extracted from the content of their correspondent pages - **7** are extracetd by querying external services. The dataset was partitioned randomly into training and testing sets, with a ratio of **two-thirds for training** and **one-third for testing**. ## Details - **Funded by:** Abdelhakim Hannousse, Salima Yahiouche - **Shared by:** [pirocheto](https://github.com/pirocheto) - **License:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) - **Paper:** [https://arxiv.org/abs/2010.12847](https://arxiv.org/abs/2010.12847) ## Source Data The diagram below illustrates the procedure for creating the corpus. For details, please refer to the paper. <div align="center"> <img src="images/source_data.png" alt="Diagram source data"> </div> <p align="center"> <em>Source: Extract form the <a href="https://arxiv.org/abs/2010.12847">paper</a></em> </p> ## Load Dataset - With **datasets**: ```python from datasets import load_dataset dataset = load_dataset("pirocheto/phishing-url") ``` - With **pandas** and **huggingface_hub**: ```python import pandas as pd from huggingface_hub import hf_hub_download REPO_ID = "pirocheto/phishing-url" FILENAME = "data/train.parquet" df = pd.read_parquet( hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") ) ``` - With **pandas** only: ```python import pandas as pd url = "https://huggingface.co/datasets/pirocheto/phishing-url/resolve/main/data/train.parquet" df = pd.read_parquet(url) ``` ## Citation To give credit to the creators of this dataset, please use the following citation in your work: - BibTeX format ``` @article{Hannousse_2021, title={Towards benchmark datasets for machine learning based website phishing detection: An experimental study}, volume={104}, ISSN={0952-1976}, url={http://dx.doi.org/10.1016/j.engappai.2021.104347}, DOI={10.1016/j.engappai.2021.104347}, journal={Engineering Applications of Artificial Intelligence}, publisher={Elsevier BV}, author={Hannousse, Abdelhakim and Yahiouche, Salima}, year={2021}, month=sep, pages={104347} } ``` - APA format ``` Hannousse, A., & Yahiouche, S. (2021). Towards benchmark datasets for machine learning based website phishing detection: An experimental study. Engineering Applications of Artificial Intelligence, 104, 104347. ```

--- 许可证：CC-BY-4.0 配置项： - 配置名称：default 数据文件： - 拆分集：训练集（train），路径：data/train.parquet - 拆分集：测试集（test），路径：data/test.parquet 任务类别： - 文本分类（text-classification） - 表格分类（tabular-classification） - Token分类（token-classification） - 文本到文本生成（text2text-generation）样本规模类别：n<1K 标注创作者：公开采集标注（found）标签： - 钓鱼（phishing） - URL - 安全（security）语言：英语（en）展示名称：TabNetone --- # 数据集描述本数据集包含11430条统一资源定位符（URL），并提取了87项特征。本数据集旨在作为基于机器学习的钓鱼检测（phishing detection）系统的基准测试数据集。本数据集为平衡数据集，其中钓鱼URL与合法URL各占50%。特征分为三大类： - **56**项特征提取自URL的结构与语法 - **24**项特征提取自对应网页的内容 - **7**项特征通过调用外部服务提取。本数据集已按随机方式划分为训练集与测试集，划分比例为训练集占三分之二，测试集占三分之一。 ## 详细信息 - **资助方：** Abdelhakim Hannousse、Salima Yahiouche - **共享方：** [pirocheto](https://github.com/pirocheto) - **许可证：** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) - **相关论文：** [https://arxiv.org/abs/2010.12847](https://arxiv.org/abs/2010.12847) ## 源数据构建流程下图展示了该数据集语料库的构建流程，详细流程请参阅上述论文。 <div align="center"> <img src="images/source_data.png" alt="源数据示意图"> </div> <p align="center"> <em>来源：摘自<a href="https://arxiv.org/abs/2010.12847">论文</a></em> </p> ## 数据集加载方式 - 使用**datasets**库加载： python from datasets import load_dataset dataset = load_dataset("pirocheto/phishing-url") - 使用**pandas**与**huggingface_hub**库加载： python import pandas as pd from huggingface_hub import hf_hub_download REPO_ID = "pirocheto/phishing-url" FILENAME = "data/train.parquet" df = pd.read_parquet( hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") ) - 仅使用**pandas**库加载： python import pandas as pd url = "https://huggingface.co/datasets/pirocheto/phishing-url/resolve/main/data/train.parquet" df = pd.read_parquet(url) ## 引用规范若在研究工作中使用本数据集，请采用以下引用格式以标注原作者： - BibTeX引用格式： @article{Hannousse_2021, title={Towards benchmark datasets for machine learning based website phishing detection: An experimental study}, volume={104}, ISSN={0952-1976}, url={http://dx.doi.org/10.1016/j.engappai.2021.104347}, DOI={10.1016/j.engappai.2021.104347}, journal={Engineering Applications of Artificial Intelligence}, publisher={Elsevier BV}, author={Hannousse, Abdelhakim and Yahiouche, Salima}, year={2021}, month=sep, pages={104347} } - APA引用格式： Hannousse, A., & Yahiouche, S. (2021). Towards benchmark datasets for machine learning based website phishing detection: An experimental study. Engineering Applications of Artificial Intelligence, 104, 104347.

提供机构：

IRBXrocket

5,000+

优质数据集

54 个

任务类型

进入经典数据集