five

IRBXrocket/phishing-url

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/IRBXrocket/phishing-url
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 configs: - config_name: default data_files: - split: train path: data/train.parquet - split: test path: data/test.parquet task_categories: - text-classification - tabular-classification - token-classification - text2text-generation size_categories: - n<1K annotations_creators: - found tags: - phishing - url - security language: - en pretty_name: TabNetone --- # Dataset Description The provided dataset includes **11430** URLs with **87** extracted features. The dataset are designed to be used as a benchmark for machine learning based **phishing detection** systems. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Features are from three different classes: - **56** extracted from the structure and syntax of URLs - **24** extracted from the content of their correspondent pages - **7** are extracetd by querying external services. The dataset was partitioned randomly into training and testing sets, with a ratio of **two-thirds for training** and **one-third for testing**. ## Details - **Funded by:** Abdelhakim Hannousse, Salima Yahiouche - **Shared by:** [pirocheto](https://github.com/pirocheto) - **License:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) - **Paper:** [https://arxiv.org/abs/2010.12847](https://arxiv.org/abs/2010.12847) ## Source Data The diagram below illustrates the procedure for creating the corpus. For details, please refer to the paper. <div align="center"> <img src="images/source_data.png" alt="Diagram source data"> </div> <p align="center"> <em>Source: Extract form the <a href="https://arxiv.org/abs/2010.12847">paper</a></em> </p> ## Load Dataset - With **datasets**: ```python from datasets import load_dataset dataset = load_dataset("pirocheto/phishing-url") ``` - With **pandas** and **huggingface_hub**: ```python import pandas as pd from huggingface_hub import hf_hub_download REPO_ID = "pirocheto/phishing-url" FILENAME = "data/train.parquet" df = pd.read_parquet( hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") ) ``` - With **pandas** only: ```python import pandas as pd url = "https://huggingface.co/datasets/pirocheto/phishing-url/resolve/main/data/train.parquet" df = pd.read_parquet(url) ``` ## Citation To give credit to the creators of this dataset, please use the following citation in your work: - BibTeX format ``` @article{Hannousse_2021, title={Towards benchmark datasets for machine learning based website phishing detection: An experimental study}, volume={104}, ISSN={0952-1976}, url={http://dx.doi.org/10.1016/j.engappai.2021.104347}, DOI={10.1016/j.engappai.2021.104347}, journal={Engineering Applications of Artificial Intelligence}, publisher={Elsevier BV}, author={Hannousse, Abdelhakim and Yahiouche, Salima}, year={2021}, month=sep, pages={104347} } ``` - APA format ``` Hannousse, A., & Yahiouche, S. (2021). Towards benchmark datasets for machine learning based website phishing detection: An experimental study. Engineering Applications of Artificial Intelligence, 104, 104347. ```

--- 许可证:CC-BY-4.0 配置项: - 配置名称:default 数据文件: - 拆分集:训练集(train),路径:data/train.parquet - 拆分集:测试集(test),路径:data/test.parquet 任务类别: - 文本分类(text-classification) - 表格分类(tabular-classification) - Token分类(token-classification) - 文本到文本生成(text2text-generation) 样本规模类别:n<1K 标注创作者:公开采集标注(found) 标签: - 钓鱼(phishing) - URL - 安全(security) 语言:英语(en) 展示名称:TabNetone --- # 数据集描述 本数据集包含11430条统一资源定位符(URL),并提取了87项特征。本数据集旨在作为基于机器学习的钓鱼检测(phishing detection)系统的基准测试数据集。本数据集为平衡数据集,其中钓鱼URL与合法URL各占50%。 特征分为三大类: - **56**项特征提取自URL的结构与语法 - **24**项特征提取自对应网页的内容 - **7**项特征通过调用外部服务提取。 本数据集已按随机方式划分为训练集与测试集,划分比例为训练集占三分之二,测试集占三分之一。 ## 详细信息 - **资助方:** Abdelhakim Hannousse、Salima Yahiouche - **共享方:** [pirocheto](https://github.com/pirocheto) - **许可证:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) - **相关论文:** [https://arxiv.org/abs/2010.12847](https://arxiv.org/abs/2010.12847) ## 源数据构建流程 下图展示了该数据集语料库的构建流程,详细流程请参阅上述论文。 <div align="center"> <img src="images/source_data.png" alt="源数据示意图"> </div> <p align="center"> <em>来源:摘自<a href="https://arxiv.org/abs/2010.12847">论文</a></em> </p> ## 数据集加载方式 - 使用**datasets**库加载: python from datasets import load_dataset dataset = load_dataset("pirocheto/phishing-url") - 使用**pandas**与**huggingface_hub**库加载: python import pandas as pd from huggingface_hub import hf_hub_download REPO_ID = "pirocheto/phishing-url" FILENAME = "data/train.parquet" df = pd.read_parquet( hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") ) - 仅使用**pandas**库加载: python import pandas as pd url = "https://huggingface.co/datasets/pirocheto/phishing-url/resolve/main/data/train.parquet" df = pd.read_parquet(url) ## 引用规范 若在研究工作中使用本数据集,请采用以下引用格式以标注原作者: - BibTeX引用格式: @article{Hannousse_2021, title={Towards benchmark datasets for machine learning based website phishing detection: An experimental study}, volume={104}, ISSN={0952-1976}, url={http://dx.doi.org/10.1016/j.engappai.2021.104347}, DOI={10.1016/j.engappai.2021.104347}, journal={Engineering Applications of Artificial Intelligence}, publisher={Elsevier BV}, author={Hannousse, Abdelhakim and Yahiouche, Salima}, year={2021}, month=sep, pages={104347} } - APA引用格式: Hannousse, A., & Yahiouche, S. (2021). Towards benchmark datasets for machine learning based website phishing detection: An experimental study. Engineering Applications of Artificial Intelligence, 104, 104347.
提供机构:
IRBXrocket
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作