raileymontalan/SEA-PILE-v2-tl-tokenized-stochastok0.1
收藏Hugging Face2025-11-13 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/raileymontalan/SEA-PILE-v2-tl-tokenized-stochastok0.1
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了网页快照信息,具体字段包括网页内容(dump),时间戳(timestamp),网页URL(url),Warc记录ID(warc-record-id),网页标识符列表(ids),以及网页内容长度(len)。数据集分为训练集和验证集,其中训练集包含约4538452个样本,大小为22378798572字节;验证集包含45843个样本,大小为224902251字节。数据集总大小为22603700823字节,下载大小为4810909433字节。
The dataset consists of web snapshot information, including fields such as web content (dump), timestamp (timestamp), web URL (url), Warc record ID (warc-record-id), list of web identifiers (ids), and length of web content (len). The dataset is split into a training set and a validation set, with the training set containing approximately 4,538,452 samples and a size of 22,378,798,572 bytes; the validation set contains 45,843 samples and is 224,902,251 bytes in size. The total size of the dataset is 22,603,700,823 bytes, with a download size of 4,810,909,433 bytes.
提供机构:
raileymontalan



