semran1/DCLM-17M

Name: semran1/DCLM-17M
Creator: semran1
Published: 2025-01-14 06:54:21
License: 暂无描述

Hugging Face2025-01-14 更新2025-02-15 收录

下载链接：

https://hf-mirror.com/datasets/semran1/DCLM-17M

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含网页文本内容及其相关元数据，支持fasttext模型训练，提供了去重前的n-gram数量、语言ID、前一词数量、文本内容、URL等信息。数据集分为训练集，共有约1700万个示例，数据大小超过100GB。

This dataset includes web page text content and its related metadata, supports fasttext model training, and provides information such as the number of n-grams before deduplication, language ID, the number of previous words, text content, URL, etc. The dataset is split into a training set with a total of approximately 17 million examples, and the data size is over 100GB.

提供机构：

semran1

5,000+

优质数据集

54 个

任务类型

进入经典数据集