nhagar/CC-MAIN-2018-17_urls

Name: nhagar/CC-MAIN-2018-17_urls
Creator: nhagar
Published: 2025-05-15 04:35:11
License: 暂无描述

Hugging Face2025-05-15 更新2025-02-15 收录

下载链接：

https://hf-mirror.com/datasets/nhagar/CC-MAIN-2018-17_urls

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个包含网页抓取信息的数据集，具体包含字段有抓取内容(crawl)、URL主机名(url_host_name)和URL数量(url_count)。数据集分为训练集(train)，训练集大小为5196870313字节，共有99718873个样本。数据集的总大小为5196870313字节，下载大小为1692219591字节。

This is a dataset containing web crawl information, including fields such as crawl content (crawl), URL host name (url_host_name), and URL count (url_count). The dataset is split into a training set (train), which is 5196870313 bytes in size and contains 99718873 samples. The total size of the dataset is 5196870313 bytes, and the download size is 1692219591 bytes.

提供机构：

nhagar

5,000+

优质数据集

54 个

任务类型

进入经典数据集