nhagar/CC-MAIN-2022-49_urls

Name: nhagar/CC-MAIN-2022-49_urls
Creator: nhagar
Published: 2025-05-15 04:44:05
License: 暂无描述

Hugging Face2025-05-15 更新2025-02-15 收录

下载链接：

https://hf-mirror.com/datasets/nhagar/CC-MAIN-2022-49_urls

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了网页爬取信息，具体特征包括网页内容（crawl）、网址的主机名（url_host_name）以及网址的访问计数（url_count）。数据集主要用于训练相关的机器学习模型，如用于分析网页内容的模型或是网址访问频率的模型。训练集包含了超过五千七百万个示例。

The dataset contains web crawling information, with features including web page content (crawl), the domain name of the URL (url_host_name), and the visit count of the URL (url_count). The dataset is primarily used for training machine learning models, such as those for analyzing web page content or URL visit frequencies. The training set includes over fifty-seven million examples.

提供机构：

nhagar

5,000+

优质数据集

54 个

任务类型

进入经典数据集