five

felixZzz/dclm_1M

收藏
Hugging Face2025-10-20 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/felixZzz/dclm_1M
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含以下特征:bff_contained_ngram_count_before_dedupe(去重前的bff包含n-gram数量),language_id_whole_page_fasttext(整个页面的语言ID,使用fasttext表示),metadata(包含网页的元数据信息,如内容长度、内容类型、Warc记录信息等),previous_word_count(前一个单词的数量),text(文本内容),url(网址),warcinfo(warc信息),fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train_prob(使用openhermes reddieli5与rw v2 bigram模型训练的概率)。数据集分为训练集,共有1000000个样本,总大小为6557658104字节。

The dataset includes the following features: bff_contained_ngram_count_before_dedupe (bff contained n-gram count before deduplication), language_id_whole_page_fasttext (language ID of the whole page represented by fasttext), metadata (including metadata information of the web page, such as content length, content type, Warc record information, etc.), previous_word_count (the count of the previous word), text (text content), url (web address), warcinfo (warc information), fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train_prob (probability using the openhermes reddieli5 and rw v2 bigram model for training). The dataset is divided into a training set with a total of 1000000 samples and a total size of 6557658104 bytes.
提供机构:
felixZzz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作