Salesforce/fineweb_deduplicated

Name: Salesforce/fineweb_deduplicated
Creator: Salesforce
Published: 2025-02-03 17:14:10
License: 暂无描述

Hugging Face2025-02-03 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/Salesforce/fineweb_deduplicated

下载链接

链接失效反馈

官方服务：

资源简介：

Fineweb是一个高质量且流行的开放文本数据集，旨在用于训练语言模型。该数据集由HuggingFace实验室发布，大小为93.4 TB，包含15T个token。由于70%的数据是重复的，通过去重处理可以将数据集大小从15T减少到5T，从而降低处理成本。去重机制使用GPT4-o tokenizer对文本进行分词，并在分词后的版本上进行去重。该数据集为研究大规模数据集去重效果提供了机会。

Fineweb is a high-quality and popular open text dataset intended for training language models. It is produced by HuggingFace and has a size of 93.4 TB with 15T tokens. Since 70% of the data is duplicated, deduplication reduces the dataset size from 15T to 5T, making it cheaper to process. The deduplication mechanism uses the GPT4-o tokenizer to tokenize the text and performs deduplication on the tokenized version. This dataset provides an opportunity for research on the effects of deduplication on massive datasets.

提供机构：

Salesforce

5,000+

优质数据集

54 个

任务类型

进入经典数据集