emarro/fineweb_subset_train

Name: emarro/fineweb_subset_train
Creator: emarro
Published: 2026-04-27 20:36:18
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/emarro/fineweb_subset_train

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个大规模的文本数据集，包含873,358个训练示例，主要用于自然语言处理任务。特征包括原始文本（text）和对应的标记ID序列（input_ids），其中input_ids以uint16列表形式存储，表示文本已被预处理和标记化。数据集文件大小约为12.5 GB，下载大小约为4.9 GB，适用于模型训练和文本分析。

This dataset is a large-scale text dataset comprising 873,358 training examples, primarily designed for natural language processing tasks. It includes features such as raw text (text) and corresponding token ID sequences (input_ids), where input_ids are stored as lists of uint16, indicating that the text has been preprocessed and tokenized. The dataset has a file size of approximately 12.5 GB and a download size of about 4.9 GB, making it suitable for model training and text analysis.

提供机构：

emarro

5,000+

优质数据集

54 个

任务类型

进入经典数据集