Salesforce/fineweb_deduplicated
收藏Hugging Face2025-02-03 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/Salesforce/fineweb_deduplicated
下载链接
链接失效反馈官方服务:
资源简介:
Fineweb是一个高质量且流行的开放文本数据集,旨在用于训练语言模型。该数据集由HuggingFace实验室发布,大小为93.4 TB,包含15T个token。由于70%的数据是重复的,通过去重处理可以将数据集大小从15T减少到5T,从而降低处理成本。去重机制使用GPT4-o tokenizer对文本进行分词,并在分词后的版本上进行去重。该数据集为研究大规模数据集去重效果提供了机会。
Fineweb is a high-quality and popular open text dataset intended for training language models. It is produced by HuggingFace and has a size of 93.4 TB with 15T tokens. Since 70% of the data is duplicated, deduplication reduces the dataset size from 15T to 5T, making it cheaper to process. The deduplication mechanism uses the GPT4-o tokenizer to tokenize the text and performs deduplication on the tokenized version. This dataset provides an opportunity for research on the effects of deduplication on massive datasets.
提供机构:
Salesforce



