A Comprehensive Dataset for Webpage Classification (Part 2: Benign 1)

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/10795434

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset, split across three parts due to Zenodo's size constraints, serves as a fundamental resource for enhancing webpage classification techniques. It encompasses 1,069,715 URLs, each annotated with labels to signify their categorization into Malicious, Benign, or Adult content, and further into 20 detailed sublabels for granular analysis. The dataset is designed to facilitate the evaluation and benchmarking of machine learning models, notably Stochastic Gradient Descent (SGD) and Support Vector Classifier (SVC), across a variety of tokenization methods and input types, including URLs, raw HTML, and parsed HTML content. The primary objective of assembling this dataset is to support research into effective webpage classification, thereby improving content prioritization and filtering in web crawling applications. It has been meticulously curated to provide a robust framework for studying the impact of different feature representation techniques on classification accuracy. The dataset is structured as JSON lines (jsonl) files, with each entry detailing a URL's label, sublabel, source, status code, and HTML content. This comprehensive dataset is divided into three parts due to size constraints on Zenodo, each targeting specific content categories to ensure ease of use and accessibility for researchers: Part 1: Adult & Malicious encompasses URLs classified under Adult and Malicious categories, offering insights into content that requires stringent filtering. Part 2: Benign 1 and Part 3: Benign 2 cover benign URLs, facilitating the study of safe web content and its classification nuances. We also created a .csv file without the HTML content so it is easier to work with URLs only, this .csv file contains the next columns `['uid', 'url', 'label', 'sublabel']` By providing this dataset, we aim to contribute significantly to the field of webpage classification, offering a valuable asset for researchers and practitioners looking to advance the state of web crawling technology and its applications. JSON line format for each line: {"url": "", "label": "", "sublabel": "", "source": "", "status_code": , "html": ""} Other parts of this dataset: A Comprehensive Dataset for Webpage Classification (Part 1: Adult & Malicious) A Comprehensive Dataset for Webpage Classification (Part 3: Benign 2) Citation if you use this dataset, please cite us: Al-Maamari, M., Istaiti, M., Zerhoudi, S., Dinzinger, M., Granitzer, M. and Mitrovic, J., A COMPREHENSIVE DATASET FOR WEBPAGE CLASSIFICATION. https://ca-roll.github.io/downloads/A_Comprehensive_Dataset_for_Webpage_Classification.pdf Granitzer, M., Voigt, S., Fathima, N.A., Golasowski, M., Guetl, C., Hecking, T., Hendriksen, G., Hiemstra, D., Martinovič, J., Mitrović, J. and Mlakar, I., 2023. Impact and development of an Open Web Index for open web search. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24818

创建时间：

2024-07-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集