KrisMinchev/finemath-4plus-tokenized-0p86
收藏Hugging Face2025-12-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/KrisMinchev/finemath-4plus-tokenized-0p86
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个包含多个特征字段的网络抓取数据集,主要用于自然语言处理任务。数据集包含URL、抓取时间、内容类型、WARC文件名、文本内容、长度、字符计数、元数据、分数、整数分数、抓取来源、快照类型、语言、语言分数、输入ID和注意力掩码等字段。数据集仅包含训练集,大小为74,838,299,422字节,包含5,761,563个示例。下载大小为27,534,538,196字节。
This dataset is a web-crawled dataset containing multiple feature fields, primarily used for natural language processing tasks. The dataset includes fields such as URL, fetch time, content MIME type, WARC filename, text content, length, character count, metadata, score, integer score, crawl source, snapshot type, language, language score, input IDs, and attention mask. The dataset only contains a training set, with a size of 74,838,299,422 bytes and 5,761,563 examples. The download size is 27,534,538,196 bytes.
提供机构:
KrisMinchev



