Fineweb-Pro
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/gair-prox/FineWeb-pro
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一套英文数据集,用于在持续预训练过程中进行重放,目的是防止模型退化。同时,它也作为一个基础英文数据集,为防止模型退化提供了重要支持。这一数据集的任务是为持续预训练提供重放数据。
This is an English-language dataset designed for replay during continual pre-training to prevent model degradation. It also serves as a baseline English dataset that provides critical support for preventing model degradation. The core task of this dataset is to supply replay data for continual pre-training.
提供机构:
Hugging Face



