felixZzz/dclm_1M

Name: felixZzz/dclm_1M
Creator: felixZzz
Published: 2025-10-20 04:40:04
License: 暂无描述

Hugging Face2025-10-20 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/felixZzz/dclm_1M

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含以下特征：bff_contained_ngram_count_before_dedupe（去重前的bff包含n-gram数量），language_id_whole_page_fasttext（整个页面的语言ID，使用fasttext表示），metadata（包含网页的元数据信息，如内容长度、内容类型、Warc记录信息等），previous_word_count（前一个单词的数量），text（文本内容），url（网址），warcinfo（warc信息），fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train_prob（使用openhermes reddieli5与rw v2 bigram模型训练的概率）。数据集分为训练集，共有1000000个样本，总大小为6557658104字节。

The dataset includes the following features: bff_contained_ngram_count_before_dedupe (bff contained n-gram count before deduplication), language_id_whole_page_fasttext (language ID of the whole page represented by fasttext), metadata (including metadata information of the web page, such as content length, content type, Warc record information, etc.), previous_word_count (the count of the previous word), text (text content), url (web address), warcinfo (warc information), fasttext_openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train_prob (probability using the openhermes reddieli5 and rw v2 bigram model for training). The dataset is divided into a training set with a total of 1000000 samples and a total size of 6557658104 bytes.

提供机构：

felixZzz

5,000+

优质数据集

54 个

任务类型

进入经典数据集