nhagar/CC-MAIN-2017-34_urls

Name: nhagar/CC-MAIN-2017-34_urls
Creator: nhagar
Published: 2025-05-15 04:24:03
License: 暂无描述

Hugging Face2025-05-15 更新2025-02-15 收录

下载链接：

https://hf-mirror.com/datasets/nhagar/CC-MAIN-2017-34_urls

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个包含网页抓取信息的数据集，具体包括网页内容（crawl）、网页主机名（url_host_name）和网页计数（url_count）三个字段。数据集分为训练集，包含大量样本，可用于网页内容分析、主机名分析等任务。

This dataset contains web crawling information, including three fields: page content (crawl), page hostname (url_host_name), and page count (url_count). The dataset is split into a training set with a large number of samples, which can be used for page content analysis, hostname analysis, and other tasks.

提供机构：

nhagar

5,000+

优质数据集

54 个

任务类型

进入经典数据集