synpre/dclm_seed_2b_tanishq
收藏Hugging Face2025-01-08 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/synpre/dclm_seed_2b_tanishq
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了网页文本的相关特征,如去重前的n-gram数量、语言ID(使用FastText表示)、元数据(包含网页的各种信息如大小、类型、日期等)、前一个单词的数量、文本内容、URL、Warcinfo信息以及一个FastText概率值。数据集被划分为训练集,其中包含大约1321320个示例,总大小约为8.69GB。
The dataset includes features related to web page texts, such as n-gram count before deduplication, language ID represented by FastText, metadata containing various information about the web pages like size, type, date, etc., the count of the previous word, the text content, URL, Warcinfo, and a FastText probability value. The dataset is split into a training set, which contains approximately 1,321,320 examples and is about 8.69GB in total size.
提供机构:
synpre



