sanranjan/fineweb-CC-MAIN-2024-10-1B-en

Name: sanranjan/fineweb-CC-MAIN-2024-10-1B-en
Creator: sanranjan
Published: 2024-09-15 18:04:39
License: 暂无描述

Hugging Face2024-09-15 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/sanranjan/fineweb-CC-MAIN-2024-10-1B-en

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含多个字段，包括文本、ID、转储、URL、日期、文件路径、语言、语言评分和词数等。数据集主要用于存储和文本相关的信息，可能用于自然语言处理任务，如文本分类、情感分析或语言模型训练。数据集包含一个训练分割，共有1,500,000个样本，总大小为5,554,137,895字节。

This dataset includes multiple fields such as text, ID, dump, URL, date, file path, language, language score, and token count. It is primarily used for storing text-related information and could be utilized for natural language processing tasks such as text classification, sentiment analysis, or language model training. The dataset contains a training split with 1,500,000 samples and a total size of 5,554,137,895 bytes.

提供机构：

sanranjan

5,000+

优质数据集

54 个

任务类型

进入经典数据集