nhagar/CC-MAIN-2022-27_urls
收藏Hugging Face2025-05-15 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/nhagar/CC-MAIN-2022-27_urls
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了网页抓取信息、网址的主机名和网址的数量。它被划分为训练集,共有约5748万条样本,数据集大小为2.86 GB。数据集适用于网页分析、主机名研究或网址计数相关的任务。
The dataset includes web crawl information, URL host names, and URL counts. It is split into a training set with approximately 57.48 million samples and has a size of 2.86 GB. The dataset is suitable for web analysis, host name research, or URL count-related tasks.
提供机构:
nhagar



