five

Web Pages Dataset

收藏
arXiv2021-05-15 更新2024-06-21 收录
下载链接:
https://osf.io/7ghd2/
下载链接
链接失效反馈
官方服务:
资源简介:
Web Pages Dataset是由厄瓜多尔中央大学的研究团队创建的一个大型数据集,包含了49,438个来自全球各国的网页。这些网页涵盖了艺术与娱乐、商业与经济、教育、政府、新闻与媒体、科学和环境等多个主题。数据集不仅包括网页的视觉外观,还包含了文本和数值数据类型,如网页截图(webshot)、网页的定性和定量属性等。创建过程中,研究团队利用Python和R编写的程序自动化了大部分的数据收集、组织和调试过程。此数据集的应用领域广泛,旨在解决网页分类、网页质量评估等问题,特别是在使用卷积神经网络进行网页错误检测和多类别网页主题分类方面显示出其实用性。

Web Pages Dataset is a large-scale dataset created by a research team from the Central University of Ecuador, comprising 49,438 web pages sourced from countries across the globe. These web pages cover a wide range of topics including arts and entertainment, business and economics, education, government, news and media, science, and the environment. The dataset not only captures the visual appearance of web pages but also includes text and numerical data types, such as web screenshots (webshot), qualitative and quantitative attributes of web pages, and more. During its development, the research team automated most of the data collection, organization and debugging processes using programs written in Python and R. This dataset has broad application scenarios, designed to address issues such as web page classification and web quality assessment, and its practicality has been particularly demonstrated in web error detection and multi-class web topic classification using Convolutional Neural Networks (CNNs).
提供机构:
厄瓜多尔中央大学
创建时间:
2021-05-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作