EleutherAI/filtering-pretraining-mix

Name: EleutherAI/filtering-pretraining-mix
Creator: EleutherAI
Published: 2025-04-16 04:44:06
License: 暂无描述

Hugging Face2025-04-16 更新2025-05-31 收录

下载链接：

https://hf-mirror.com/datasets/EleutherAI/filtering-pretraining-mix

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含文本及其相关特征，如fasttext分数、语言类型及分数等。数据集分为训练集，其中包含大约4亿多个示例，整个数据集大小超过2PB。数据集适用于语言识别和文本分类等NLP任务。

The dataset includes text and its related features such as fasttext scores, language types, and scores. The dataset is split into a training set, containing approximately 409 million examples, with the entire dataset size exceeding 2PB. It is suitable for NLP tasks such as language recognition and text classification.

提供机构：

EleutherAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集